PETA - a Pedagogical Embodied Teaching Agent
-
Upload
independent -
Category
Documents
-
view
0 -
download
0
Transcript of PETA - a Pedagogical Embodied Teaching Agent
PETA – a Pedagogical Embodied Teaching Agent David Powers, Richard Leibbrandt, AILab, Informatics & Engineering
Flinders University Adelaide, South Australia +61-8-8201-3663/3659
David.Powers, Richard.Leibbrandt,
Martin Luerssen, Trent Lewis, HeadLab, Informatics & Engineering
Flinders University Adelaide, South Australia +61-8-8201-3867/3650
Martin.Luerssen, Trent.Lewis,
@flinders.edu.au
Mike Lawson Centre for Education Futures
Flinders University Adelaide, South Australia
+61-8-8201-2829
Mike.Lawson
ABSTRACT
We describe a hybrid real and virtual system that monitors and
teaches children in an everyday classroom environment without
requiring any special virtual reality set ups or any knowledge that
there is a computer involved. This system is truly pervasive in
that it interacts with a child who is playing with normal physical
toys using speech. A simulated virtual head provides a focus and
the opportunity for microphonological language teaching, whilst a
simulated world allows the teacher to demonstrate using her set of
blocks – much as a teacher might demonstrate at the front of the
class. However this system allows individual students, or pairs of
students, to interact with the task in a computer-free way and
receive feedback from PETA, the Teaching Head, as if from their
own private tutor.
Categories and Subject Descriptors H.5.2. [Information Interfaces and Presentation]: User
Interfaces – evaluation/methodology, graphical user interfaces
input devices and strategies, interaction styles, natural language,
voice I/O.
General Terms Design, Experimentation, Human Factors
Keywords CALL, second language teaching, human-computer interfaces,
affect-awareness, embodied conversational agents.
1. INTRODUCTION For effective language learning, immersion in a natural language
environment is regarded as highly desirable, if not essential [8,
13, 14, 15]. Normally, for first language or bilingual language
learning, this is provided naturally by living in an area in which
the target language is spoken. Our focus is schools in Australia
where currently children are seldom exposed to native language
speakers as teachers, and where students seldom achieve
internalization of many of the fundamental of the target language.
In this paper we will assume our target language is German or
English, as it is for our pilot experiments, but in fact we are
looking at a broader range of languages, including also Asian
languages.
As our aim is to create a teaching experience that is as similar as
possible to a natural interaction with a human parent or peer, we
have aimed to minimize the impression of dealing with a
computer and keep the number of peripheral computer
input/output devices employed in the system to a minimum: the
only display devices are a pair of computer screens, and the only
other technical equipment is a set of web cameras.
Communication with the computer is pervasive – rather than
typing at a keyboard children physically manipulate and talk
about real-world objects that are part of a shared reality for both
the computer system and the learner. In this way, the interaction
with the computer tutor is similar to conversing with a human
tutor about things in the immediate “here and now”. This physical
interface constitutes a language-neutral and indirect way of
interacting with a computer, and so may be profitably employed in
other situations where no assumption can be made about the
user’s level of language skill or computer literacy. This is
especially important as we are targeting children at an age where
they are most receptive to language but not necessarily able to
make effective use of a traditional computer interface.
2. THINKING HEAD & ROBOT WORLD One precursor of this project is the Thinking Head [2], a virtual
anthropomorphic software agent that engages the user in
multimodal dialogue involving a variety of verbal and nonverbal
behaviours. The original Prosthetic Head was commissioned for
performance artist Stelarc, and the latest version is performing at
the National Art Museum of China in association with the Beijing
Olympics. The Head is able to “speak” English sentences via
speech synthesis and facial animation. In addition, the latest
version can display emotion by means of various facial
expressions and through vocal prosody, as well as being capable
of a variety of gestures (see Fig. 1). Users can converse with the
Head by speech or keyboard input; the Head responds by
matching this input against the most appropriate verbal and
nonverbal responses using techniques that are well established for
Embodied Conversational Agents. The Thinking Head initiative
aims for a human-computer interaction that is as natural as a face-
to-face conversation with another person.
A second precursor of the project are the Magrathea and
Microjaea Robot Worlds [5, 6, 11, 16, 17]. These 3D simulations
were designed to allow a computer to learn linguistic, ontological
and semantic concepts (see Fig. 2) in the context of what has
come to be known as L0 language learning [3].
© ACM, 2008. This is the author's version of the work. It is
posted here by permission of ACM for your personal use. Not
for redistribution. The definitive version was published in
PETRA Workshop on Pervasive Technologies in e/m-
Learning and Internetbased Experiments, {VOL#, ISS#,
(DATE)} http://doi.acm.org/10.1145/nnnnnn.nnnnnn
Figure 1. Thinking Head based on performance artist Stelarc.
Originally designed to explore the role of embodied
conversational agents and virtual reality technology in
performance, it is the forefather of a series of new Thinking Heads
and applications.
3. THE TEACHING HEAD In the current Teaching Head system, learners interact with the
system by manipulating and speaking about members of a set of
physical objects that are placed within a circumscribed area, and
are surveyed by means of web cameras which track the position of
the objects as they are moved around. Using this video
information, the software system is able to maintain a
dynamically-updated model of where all objects are located in
space. In this way, both learner and teacher are potentially focused
on a shared reality, grounded in the physical world.
The vision system consists of 3 cameras strategically placed to
observe the participant’s interaction with the Teaching Head. One
camera captures the participant’s face and body to gauge
emotional state and focus of attention, and in future versions to
lip-read and track correct articulation. This camera also allows the
Teaching Head to monitor the participant current focus, e.g.,
tracking the participant’s gaze and adjusting the gaze of the
Teaching Head to look at either the student or the objects the
student is attending to. The two additional cameras monitor the
Physical Arena from orthogonal perspectives, allowing full 3D
localization. Implementation of the vision software is making use
of state-of the-art classifier techniques [10, 22] and object
tracking algorithms [1].
At the same time, a Microjaea display of a three-dimensional
virtual world is presented for the purpose of demonstrating
hypothetical situations involving these same manipulable objects.
These situations are then described by the Teaching Head in the
target second language. For instance, one might imagine the
teaching of German prepositions using a number of blocks of
various colours. The Head would utter the sentence “der grüne
Block ist unter dem blauen Block” while the rendered scene in
the imaginary arena shows a green block stacked underneath a
blue one.
In Figure 3 we show a schematic of the teaching workspace
showing one possible configuration of screens and cameras. In
fact the layout is something we are still evaluating/optimizing, and
we expect that our standard configuration will have the two
screens angled at around 120°. This allows the Head to look at
what is happening both on its MicroJaea screen and in the semi-
enclosed workspace bounded by the screen on two sides. This is
true in two senses – the Thinking Head’s simulated eyes can look
at the subject, an object or a simulated, but also the cameras
associated with the workspace can also be focused appropriately –
this can be achieved by fixed cameras with logical panning and
focusing of attention, but we are also evaluating the use of
Logitech QuickCam Sphere webcams that can physically pan as
required.
Figure 3 also illustrates a particular configuration of three
orthogonal cameras, but other configurations are being evaluated.
In particular, with the angled screens each screen can have a
mounted or built in webcam, fixed ahead or panning, and whilst
perfect orthogonality simplifies the mathematics somewhat, there
is no overriding reason to mandate orthogonality and there are
pragmatic advantages in using down looking cameras from the
two monitors (for the workspace) and an upward/forward looking
camera from between the monitors (for face tracking).
Computer input is thus implicit from tracking the user’s face,
movements and action, as well as their speech.
Figure 2. MicroJaea language teaching scenario. Originally
designed to teach the computer English, this Java 3D world and
scripting language is now being redeployed to teach children
English and German.
4. PEDAGOGICAL APPROACH Krashen [9] distinguished between acquisition and learning, that
is acquiring language in a spontaneous, uncritical, unconscious
manner, rather than formal language learning, which is concerned
with a conscious inculcation of language rules and facts that are
used by the learner to criticise and correct his/her own
productions. We consider this to be an important distinction and
regard language learning as more akin to a social negotiation [17]
– we learn and indeed invent language and culture in an integrated
fashion that is as much influenced by our interactions with other
learners (peers) as our parents and teachers (models). In a normal
“mother tongue” environment the language of our parents,
teachers and older siblings is similar enough that we do indeed
learn something pretty close to our mother’s language. But where
there is a mix of languages or we have closer association with
peers than models, we tend to learn a creolized amalgam of our
own invention plus multiple linguistic influences.
In a classroom situation, we have relatively few hours available
for language learning each week – sometimes not even one hour
(e.g. one 40 minute lesson a week). This hour is often not well
used from the acquisition perspective. Typical exercises in
primary school are focused on vocabulary learning, drawing
pictures or identifying pictures. At the critical ages for
internalizing the complex morphological and syntactic system of a
language we are simply not providing children with appropriate
models, and we are not requiring them or giving them opportunity
to experiment with language and succeed or fail in expressing
themselves comprehensibly.
A crucial prerequisite for language acquisition is that learners
should be engaged in an activity or discussion with a native
speaker of the target language (German) – this interaction should
ideally be monolingual. Giving away that you can speak their
language (English), reduces the motivation to communicate.
Interspersing English and German into the one conversation
reduces the actual exposure to German and confuses the features
of the languages.
The ideal way of achieving native speaker competence is to have a
particular situation or person where you only communicate in the
target language. In mixed background families or environments
this often relates to a parent, grandparent, governess/au pair, or
other person who naturally speaks a different language. This then
leads to coordinate bilingualism [23] where we see native speaker
competence in the specific contexts where the child was exposed
to that language, and often this is associated with quite different
cultural and personality traits. On the other hand, vocabulary and
registers for other areas of discourse will be lacking – but the
morphological and syntactic infrastructure of language will be
properly in place and exposure to new situations will lead to quick
learning of the appropriate vocabulary and register.
To ensure the required level of motivation and interest on the part
of the learner, as well as the immersion and embodiment that
characterizes learning native level competence, our focus is on
providing intriguing problem solving and collaborative
environments for learning. For this reason, all lessons are
structured as language games, involving physical props, and are
designed to be interesting to learners of the target age.
Because interfacing with the system using a keyboard and mouse
are likely to detract from the naturalness of the interaction, and so
divert learners’ attention from the task at hand, the interface is
designed to be very similar to a normal conversation, with both
parties speaking, and the learner moving physical props around so
as to achieve language game goals such as constructing bridges,
assembling animals or robots from body parts, etc.
The teacher, in this case PETA, the Teaching Head, has an
important role to play in keeping the student interested in the
activity, and gauging understanding by ensuring that the student is
paying attention to the appropriate things. Hence, abilities to
“read” facial expressions and to track visual gaze are
indispensable; these will be determined from video input coming
from a camera directed at the learner during the interaction.
Lastly, the teacher needs to structure the language that he/she uses
in such a way that it is not well beyond the reach of the student,
but also not too simple. The optimal complexity for language
acquisition seems to be language input that is just slightly more
complex than the level that the student has already attained [9,
17]. To this end, our computerized teaching system will be able to
gauge the learner’s current level of development, and to adapt the
complexity of its utterances accordingly.
Virtual Arena Teaching Head
Physical Arena
Participant
Camera
Arena
Cameras
Figure 3. The layout of the Teaching Head installation.
5. SOFTWARE SYSTEM COMPONENTS
5.1 Thinking Head The Thinking Head is a three-dimensional, animated human head
capable of speaking arbitrary English sentences by synthesizing
the sounds of the sentence while moving its facial effectors in an
appropriate manner. The Head is also capable of a wide range of
emotional facial expressions. Users can converse with the Head
by speech or keyboard input; the Head responds by matching this
input against the most appropriate verbal or nonverbal responses.
The Thinking Head initiative aims for a human-computer
interaction that is as natural as a face-to-face conversation with
another person. Nonverbal signals play an important role here, as
agents that correctly perceive and respond with such signals are
more likely to be accepted as equal partners. The use of a Head
that models natural articulation and expression is a unique facet of
this pedagogical software system. The English-speaking version
of the Head is being developed in Australia in the context of an
ARC (Australian Research Council) Thinking Systems project,
whilst the German-speaking version of the Head is being
developed under a DFG (Deutsches Forschungs-Gemeinschaft)
project.
Currently, Thinking Head emotions/expressions are primarily
controlled by metadata in the form of tags embedded into the
scripts. However, we are also experimenting with semantic
modeling techniques [25-27] involving corpus-based and
thesaurus-based word similarity techniques. Longer term, we are
building the capability of recognizing expressions and modeling
learner attitudes, although due to the data collection and model
training this will be gradually introduced during the alpha- and
beta-testing phases with actual school students undertaking real
lessons.
5.2 Physical arena In the current system, the Head will become a virtual participant
in the physical world by creating and monitoring an internal
computer representation of a small section of the world, which
will then be the topic of its verbal exchanges with the learner.
This will be achieved by setting aside a circumscribed area of the
real world and placing it under surveillance by video cameras.
Within this ‘physical arena’ will be placed a number of small
objects/”props”, such as blocks/toy animals, which the learner will
be able to manipulate. Information from the cameras will allow
the computer system to build a computer model of the physical
arena, which can be updated dynamically as the learner moves
objects about. In this way, both learner and teacher are focused on
a shared reality, grounded in the physical world.
The physical arena is a unique characteristic of this system in that
users interact as usual with toys and other real-world props. There
is no obvious interaction with a computer, and the user won’t see
anything that looks like a computer (keyboard, mouse, etc.) other
than the two TV-like screens that bound the far sides of the
physical arena. In Wizard-of-Oz training/evaluation versions of
the experiments the same set up can be used for remotely
monitored teaching and collection of training data. In some such
experiments with Embodied Conversational Agents, people have
refused to believe that there wasn’t a Wizard. As Weizenbaum
[24] learned from the Eliza experience, this suspension of
disbelief does not so much result from the “intelligence” of the
computer but its ability to satisfactorily fulfill the role one
expects. In the case of Eliza, playing the part of a psychologist,
all the computer had to do was reflect back what the patient was
saying, and use the occasional keyword match or random change
of subject to keep the conversation moving. This is still the basic
technique in use in the present Thinking Head. However for the
Language Teaching Head, we again have a situation where the
“intelligence” required of the system is minimal, and this mainly
relates to being able to track what the learner is doing in the
physical arena.
The physical arena indeed supplies literal grounding [4], that is
argued to be required for true understanding.
5.3 Virtual arena In addition to the physical arena, there is a parallel ‘virtual arena’,
consisting of a computer screen display of a three-dimensional
scene. The virtual arena will contain virtual depictions of the same
objects that are in the physical arena. Virtual world scenes will be
created and displayed by means of the MicroJaea virtual world
software, developed in the AI Lab at Flinders for teaching the
computer language, the L0 task.
MicroJaea (Fig. 2) allows a script writer to define entities that will
populate a virtual world, and to create elementary animated
movies involving these entities. The virtual arena is what allows
the system to ‘teach’ L2 to the learner. The system uses the
MicroJaea rendering capabilities to demonstrate to the learner
situations involving the objects in the physical arena, while at the
same time modeling the descriptions of these situations in L2. So,
for instance, one might imagine the teaching of German
prepositions in a situation where the physical arena contains, say,
a number of blocks of various colours. The Head could then utter
the sentence ‘der grüne Block ist unter dem blauen Block’ while
the rendered scene in the imaginary arena shows a green block
stacked underneath a blue one. The focus of teaching could then
involve variations on this basic theme, varying the preposition
used and the objects involved. The virtual arena will also display
text versions of the sentence currently being spoken.
The virtual arena can also be argued to ground, although [4]
would see this as just simulated grounding. However, given that
the learner successfully makes the connection back to the physical
arena, and the computer understands correctly and reflects it in
manipulating the virtual objects, the grounding cycle can be seen
to be complete.
5.4 Vision System The vision system for the L2 Project consists of 3 cameras
strategically placed to observe the participant’s interaction with
the Teaching Head. One camera (Participant Camera, Figure 3)
captures the participant’s face and body to gauge emotional state,
focus of attention, and allows the Teaching Head to track the
participant, e.g., tracking the participant’s gaze and adjusting the
gaze of the Teaching Head to look at and to focus the attention of
the participant. The two additional cameras monitor the Physical
Arena from orthogonal perspectives, allowing full 3D
localization. The arena cameras keep track of the current state of
the physical world (e.g. the coloured blocks in Figure 3).
The computational aspects of the vision system are quite complex,
however, many of the algorithms required for participant and
object tracking are implemented in the Intel Open Source
Computer Vision (OpenCV) library. Specifically we will make
use of the cascade of boosted classifiers using Haar-like features
[10, 19] and the CAMSHIFT object tracking algorithms [1].
These algorithms have been successfully applied to multiple face-
tracking and arbitrary object detection and tracking. The
complexity of the system is also reduced by having a constrained
physical arena.
For this project, we have available some 3D equipment which
comes into play in ensuring that we can achieve our expectations
in relation to the Vision System. We have 3D cameras/scanners,
printers and autostereoscopic monitors available and the ability to
accurately scan and produce both real and simulated objects is
essential to the project. We can make use of existing libraries of
3D rendered objects, and produce them both on our 3D printer
and on a 2D or 3D screen. We can also visually capture (using
stereo cameras or laser scanners) 3D objects to produce our own
library of objects based on the desired props and toys.
This has several advantages – not only can we ensure arbitrarily
close correspondence between the physical objects and the
simulated objects, we are able to derive significant advantage in
relation to the vision monitoring. In particular we are cheating
slightly by ensuring that we can distinguish different objects and
parts by hue and saturation alone. Note that effects due to shadow
largely do not affect hue and saturation but rather define a set of
narrow colour equivalence classes. Similarly effects due to
changes in lighting are systematic and can be dealt with by white-
balancing. Thus for the early stages, and many of the proposed
scenarios, complex pattern matching is obviated.
5.5 Language games Throughout the teaching process, the system will automatically
assess the student’s knowledge in the context of a number of
‘language games’. These activities will be engaging for young
students, and will involve simple tasks such as constructing
towers, bridges, etc. from building blocks, or putting together a
toy animal from detached body parts. During assessment, the
learner is required to demonstrate his/her newly-acquired
knowledge of the second language via manipulation of the
physical objects. In one possible language game, the Head would
speak a sentence in L2, and the learner would be expected to enact
the situation described (e.g. by stacking a blue block on top of a
green one). From the video information obtained from its
cameras, the system will be able to determine whether the correct
situation had been recreated, and will provide appropriate
feedback.
Although this interface is immediately extensible to applications
outside teaching, it is particularly well-suited for application to
teaching a second language, as it allows language learning to take
place in a context of social interaction, which has been shown to
improve acquisition [12]. The approach is compatible with
approaches to language teaching in which language acquisition is
developed within a context of social interaction that is learner-
centred, and promotes complex thinking and critical literacy.
Feedback is provided not by text on a screen, but in spoken form
by an engaging, human-like interlocutor. This simulates provision
of a personal language tutor for every student.
Although we are focused on employing puzzles and games to
teach language, the class of tasks used can be much broader than
this might suggest. It should also be noted that we have defined
between subjects within class control experiments in which part of
the class uses the system in their native language, except for the
final evaluation stage of the experiment. Thus puzzles and games
should be compelling even when undertaken in the native
language. The idea is to keep the student engaged and immersed,
to give them tasks and goals to achieve, but these are not
intrinsically linguistic tasks and goals. Rather the use of language
is incidental but necessary to successful collaboration and
achievement of the goals. To the extent that the required level of
language ability is not present, it will be harder to achieve the
goal, but the goal is also graded and matched to the students
cognitive and problem-solving level.
Many well-known games and toys are candidates here, including
both traditional board games and modern computer games,
solitaire games and challenging puzzles. To use the example of
blocks: Tetris, Lego, Tower of Hanoi, Jenga can all be regarded as
examples of goal-oriented block play.
5.6 Empathic and phonological feedback In teaching a language, a good teacher avoids saying “right” or
“wrong”, and aims to be generally encouraging, allowing students
to safely make mistakes, making the learning experience game-
like and fun so that risk-taking is appropriate, and modelling the
ideal form of response (including correct vocabulary, grammar,
pronunciation, inflection, etc.) In such a response the teacher will
tend to very subtly emphasize particular features of the response
that are currently being targeted, whilst perhaps not drawing
attention to other problems that are not currently in focus.
Generally, the teacher’s voice and face will convey an appropriate
degree of approbation. At this stage of our design and
experimentation, we are focussing on the visual cues to affect.
In relation to learning, our initial focus is on grammatical and
morphological learning, but we envisage that a follow-on trial will
be concerned with phonological and articulatory learning, as we
use the head to model the correct way of holding the face,
utilizing the musculature, and hence articulating the phonology
correctly. We envisage that a future version of the system will be
able to render the child’s own voice and face so that the Head can
mirror the child as s/he talks, and then show exactly how it should
look and sound – we will be looking to see whether there is any
advantage in this being in the child’s own face/voice, a same
gender or similar age persona, etc.
Empathic capabilities are in the initial version solely visual – they
are not reflected in the audio output. These abilities include a
broad range of expressions including concern, approval,
happiness, etc, as well as specific gestures such as nods and
winks. In addition, as mentioned earlier, the both the head and
the eyes can be directed either to the student (with automatic
tracking using one of the cameras), or to a physical or virtual
object in the shared attention space, in order to direct or confirm
semantic reference. Phonological capabilities are currently
visually expressed by a reasonably accurate lip-syncing.
Generally, the facial articulation can not only be controlled in
terms of specificity, but in degree or subtlety. For performance
purposes it is usual to exaggerate expressivity, but for educational
purposes this is something that is being evaluated separately and
contrastively. In particular, it is hypothesized, based on teacher
training and practice, that it is of value to emphasize the
particular features of the utterance that are in focus, and to control
the degree of approbation finely.
6. EXPERIMENTAL EVALUATION During experimental testing with users in schools, we will
determine the extent to which particular aspects or components of
the system are crucial to attaining good pedagogical outcomes.
We will investigate whether having a physically grounded
interface is necessary for teaching success. Likewise, we will
investigate the effect of a reduced display of affect and personal
warmth on the part of the Head, and the use of detailed, verbose
explanations during feedback versus more terse responses.
However a significant feature of this project is that prior to
finalizing the details of the system to be deployed in schools we
are conducting tuning experiments specifically aimed at
determining the effect of affect. This is being measured using
variations of standard techniques for Virtual Reality evaluation,
including VRuse [7] from a general perspective and the
framework of [21] from a pedagogical perspective. These
frameworks do not directly assess the effect of affect, and some
measures are in appropriate due to different assumptions about the
nature of the virtual environments.
Two kinds of initial evaluation are performed, within subjects
comparison of performance between two specific affective
conditions, as well as between subjects comparison across
conditions not directly testable in this way due to interference
effects. Two kinds of measures are made in relation to each
condition: performance measures (e.g. using comprehension tests
in our first battery of experiments) and affective measures
(specific Likert questions: I found the Head
Likeable/Engaging/Easy to understand/Life-like/Humorous). In
addition to the evaluation of two conditions per subject (in a
single session with randomized order of conditions and
questions), a number of other questions are asked relating to
background and overall impression to facilitate a proposed factor
analysis.
The initial pair of conditions being investigated within subject is
the effect of expression in the Head’s speech, whilst the initial
pair of conditions being investigated between subjects is the affect
of eye contact/tracking.
This initial evaluation is being followed by two streams of
classroom evaluation, covering multiple design conditions. We
have two specific aims for the evaluation process.
Aim 1: To evaluate the influence of specific computer-assisted
language learning (CALL) components of the Thinking Head
system on language learning outcomes;
Aim 2: To evaluate the influence of specific aspects of student-
teacher interaction in the Thinking Head on language learning
outcomes.
The first stream of evaluation will be conducted as a broad scale
pilot study encompassing preschool, K-12, and university
evaluation the major focus for which will be the broad range of
classes offered by the Adelaide School for German. This study is
intended to derive preliminary results about both these questions,
as well as test the age suitability of the various kinds of materials
and scenarios under development. Whilst our overall aim is to
simulate the learning of preschool children, and primary school
students are seen to be the key age groups for our studies, we must
recognize that most language teaching takes place at high school
and as adults. A second feature of this study is that Adelaide
School for German has a large proportion of students with some
degree of existing German background, so that degree of language
competence will be able to be examined independent of age, and
the applicability of the approach to people who already have some
kind of natural German-speaking environment will allow degree
of exposure to the specific kinds of teaching scenarios to be
analyzed versus general exposure to German – to validate the
teachers assessment that these German-background students also
fail to master the preposition and case systems and can benefit
from our system.
Following tuning of the system in the light of this pilot study, we
will conduct a full study which will explore many variants of the
system at the focal year 3-4 (8-9 year old) level, as well as
conducting a second series of age varying experiments at
preschool, first year high school and first year university, when
there are contingents of students with vastly varying degrees of
exposure. We will specifically conduct three experiments in which
the Thinking Head system is used to teach German prepositions,
in order to evaluate the extent to which the system is successful at
the teaching task, and to identify crucial aspects of the system:
Experiment 1 will involve a comparison of the effects of
teaching students using the full Teaching Head system versus
teaching the same material in a number of traditionally structured
sessions conducted by an experienced German teacher. For this
reason, a number of participating classes will be randomly split in
two, with one group being taught German spatial prepositions in a
traditional classroom-based manner, and the other group being
taught by means of the Head. We will conduct this experiment
using students from each of our 4 target groups (preschool, years
3-4 (8-9 years old), first-year secondary school, first-year
university).
Language performance will be assessed by means of a series of
tests of graded difficulty for both receptive and productive
language, including text-based multiple-choice tests, and, notably,
by assessing performance on an unseen language game which will
be conducted using the complete Teaching Head system. (For this
reason, it is important that students who are taught via traditional
methods will also receive exposure to the Thinking Head
interface, playing the same language games as the other group, but
in English rather than German, in order to ensure that they have a
comparable level of familiarity with the interface at test).
Performance of the two groups on the two assessment measures
will be compared for significant differences by means of t-tests.
For each age group, we will require a minimum of 2 classes
which will be split in this way.
Experiment 2 will be concerned with determining which
components of the Teaching Head system are crucial to success,
in line with our Aim 1. The three conditions of interest are: the
full system, the system without the visual Head interface (i.e. the
sounds of words are produced but no visual animated head is
present; No Head), and the system without the physical interface
with blocks and toys (instead these are virtual only and
manipulated using a touch screen or mouse; No physical arena).
Here we will focus on the main target age of 8/9-year-old
schoolchildren only. Classes will be split and groups randomly
assigned to each of these three conditions for the duration of
teaching. Subsequently, evaluation of learning outcomes will be
conducted on each of the groups using the same methods as for
Experiment 1.
We hypothesize that students exposed to the full system will have
a superior knowledge of prepositions to those exposed to reduced
systems.
If testing were to be conducted using the full system only,
students who had been taught using that system would enjoy a
familiarity advantage during testing. To avoid this, students will
be assessed using first the system (full or reduced) on which they
were taught, followed by the one on which they had not been
taught. This allows us to distinguish the effects of the training
interface against the effects of the testing interface.
The significance of the effect of the testing interface will be tested
by paired t-tests within subjects (for each testing condition
individually as well as both combined), and a significant effect of
the teaching interface by unpaired t-tests between subjects, as well
as ANOVAs with time, age and continuous assessment variates.
We will require a minimum of 2 split school classes in each
condition and age group.
Experiment 3 will be concerned with evaluating the effects of
aspects of the student- tutor interaction itself that are crucial to
success (in line with our Aim 2). In particular, we will evaluate
the effects of having markedly reduced affective behaviour on the
part of the Head (Reduced Affect), and of providing terse
responses with little explanation (Reduced Feedback). The
experimental design and evaluation procedures are exactly the
same as for Experiment 2.
We hypothesize that the full Head system will be more efficacious
in teaching German prepositions to students than the reduced
versions.
Qualitative student and teacher evaluation of the system: In
addition to the quantitative evaluation of the use of the system for
language teaching, both student and teacher views about use of
the system and system features will be gathered in all 3
experiments.
The analysis of this qualitative information will form an important
part of the educational evaluation of the system.
Quantitative evaluation measures: It is most important to ensure
unbiased evaluation that takes into account performance levels
that are achievable without using the Teaching Head system or the
specific feature under investigation in any given experiment
For this reason, careful baselines and controls have been included
in the above experiments, and we will be using Bookmaker
Informedness and Markedness [18-20] to directly determine the
unbiased probabilities of the tested contingencies.
This approach will also directly give us confidence intervals,
significance and power estimates which will be complementary to
the tests used in the outlined experiments and will help determine
the final group sizes.
7. CONCLUSION At the time of submission, the formal evaluation is only just
starting, although preliminary experiment results show a
significant effect in the provision of appropriate facial
experimentation in verbal comprehension task..
Informally, we have found in our preliminary experiments and
demonstrations that the target age group of around 9 years old
does seem to reflect a peak of interest in and willingness to
engage in long conversations with the Thinking Head. We have
found that teachers are very keen on the concept, and teachers
have identified the interaction of prepositions and case in German
has been identified as a key area which students tend not to master
by the end of high school, even when the students have a German-
speaking home background as immigrants to Australia – this area
is thus the focus of our current design and evaluation.
Users of the systems, referred to as clients, are generally
impressed by the fact that the head maintains eye contact, and by
the range of expression that is used. The Thinking Head that is the
basis for these experiments is on display New Media Arts
Exhibition at the National Art Museum of China in the lead up to
the Beijing Olympics – for this performance purpose expressions
tend to be more exaggerated than natural. However, we are still
conducting experiments on subjects blind to our capability of
controlling these features, and expressions are designed to be as
natural and appropriate as possible for the subject matter, which
has been chosen from standard educational materials targeted to
senior primary school students, although for our initial evaluation
our subjects are first year university students.
The purpose of the present study is to formally confirm that the
Teaching Head is able to successfully employ supralinguistic
features to achieve demonstrable gains in teaching performance –
unfortunately very little educational software seems to be
evaluated formally, and similarly choices in user interfaces and
virtual reality tend to be assessed post hoc rather than developed
based on a systematic psychological evaluation of the effect of
individual features.
Currently, our focus is on ‘workshopping’ with teachers and
educators to develop the teaching materials/scenarios and refine
the approach and experimental methodology. This is a major
project with international cooperation from several countries, and
with strong interest in extension to further languages. In
presenting in this forum, we are specifically inviting feedback on
our approach and encouraging participation in the project – both
in terms of porting to additional languages and supporting
development of additional teaching content, and in the application
and evaluation of the Teaching Head in different countries and
educational contexts.
We will report elsewhere on results obtained using a read-out-
loud task in which the Thinking Head read out a text with and
without expression, and the user completed both Likert testing of
affect and comprehension questions based on the read material.
Preliminary results show a range of response to perceived
difficulty of the material presented, and where difficulty is well
matched to the subject, a significant positive effect to the addition
of appropriate expression by the Thinking Head, but this effect is
not seen for a text perceived as difficult.
8. ACKNOWLEDGEMENTS The Thinking Head project is funded under ARC SRI
TS0669874, in the context of the Australian Research Council
and National Health and Medical Research Council joint Special
Research Initiative in Thinking Systems. Australian Partners
include the University of Western Sydney, Flinders University,
Macquarie University, University of Canberra, Industry Partners
include Seeing Machines Ltd, International Partners include
Carnegie Mellon University, Berlin University of Technology and
Technical University of Denmark.
Development of the German language Head is supported by the
Deutsches Forschungs Gemeinschaft (DFG).
We also acknowledge and appreciate the assistance of the Goethe
Institute and numerous schools, teachers and students.
9. REFERENCES [1] Bradski, G. R. 1998. Computer Vision Face Tracking For
Use in a Perceptual User Interface. Intel Technology Journal,
1998, Q2, 15
[2] Davis, C, Kim, J. Kuratate, T. & Burnham, D. 2007. Making
a thinking-talking head. Proceedings of the International
Conference on Auditory-Visual Speech Processing (AVSP
2007). Hilvarenbeek. The Netherlands.
[3] Feldman, J.A., Lakoff, G., Stolcke, A. & Hollbach Weber, S.
1990. Miniature language acquisition: a touchstone for
cognitive science. Technical report, ICSI, Berkeley, CA.
[4] Harnad, S. 1990. The Symbol Grounding Problem, Physica
D 42: 335-346.
[5] Homes, D. A. 1998. Perceptually Grounded Language
Learning. B.Sc. Honours Thesis, Dept of Computer Science,
Flinders University, Adelaide.
[6] Hume, D. 1984. Creating Interactive Worlds with Multiple
Actors, Computer Science Honours Thesis, Electrical
Engineering and Computer Science, Uni. of NSW, Sydney.
[7] Kalawsky, R.S. 1999. VRUSE-a computerized diagnostic
tool: for usability evaluation of virtual/synthetic environment
systems. Applied Ergonomics, 30, 11-25.
[8] Kindler, A. 2002. Survey of the states' limited English
proficient students and available educational programs and
services, 2000-2001 summary report. National Clearinghouse
for English Language Acquisition & Language Instruction
Educational Programs.
[9] Krashen, S. D. 1981. Principles and Practice in Second
Language Acquisition. English Language Teaching series.
London: Prentice-Hall International UK Ltd.
[10] Lienhart, R. & Maydt, J. 2002 An Extended Set of Haar-like
Features for Rapid Object Detection, IEEE ICIP 2002, 1,
900-903.
[11] Li Santi, M., Leibbrandt, R.E. & Powers, D.M.W. 2007.
Developing 3D Worlds for Language Learning, Australian
Society for Cognitive Science Conference, Adelaide,
Australia.
[12] Lucas, T. 1993. Secondary schooling for students becoming
bilingual. In M. B. Arias & U. Casanova (Eds.), Bilingual
education: politics, practice and research pp. 113-143.
Chicago: University of Chicago Press.
[13] MacWhinney, B. 1997. Second language acquisition and the
Competition Model. In A.M.B. de Groot & J. Kroll (Eds.),
Tutorials in bilingualism: Psychological perspectives pp.
169-199. Mahwah, NJ: Erlbaum.
[14] Montessori, M. 1907. House of children: a planet without
schools or teachers. http://www.montessori.edu/maria.html
accessed 15 Feb 2008.
[15] Piaget, J. 1928/2001. Judgment and reasoning in the child.
New York: Harcourt, Brace and Company.
[16] Powers, D.M.W., Leibbrandt, R.E., Li Santi, M. & Luerssen,
M.H. 2007. A multimodal environment for immersive
language learning - space, time, viewpoint and physics, Joint
HCSNet-HxI Workshop on Human Issues in Interaction and
Interactive Interfaces, Sydney, Australia.
[17] Powers, D.M.W. & Turk, C. 1989. Machine Learning of
Natural Language, Research Monograph, New York/Berlin:
Springer-Verlag.
[18] Powers, D M. W. 2003. Recall and Precision versus the
Bookmaker. International Conference on Cognitive Science,
pp529-534. http://david.wardpowers.info/BM/BMPaper.pdf
accessed 11 May 2008.
[19] Powers, D.M.W. 2007, Evaluation: From Precision, Recall
and F-Factor to ROC, Informedness, Markedness &
Correlation, School of Informatics and Engineering, Flinders
University • Adelaide • Australia, Technical Report SIE-07-
001, December 2007
[20] Powers, D.M.W. (in press), Evaluation Evaluation, ECAI
July 21-25 2008.
[21] Roussou, M., Gillingham, M., Moher, T. 1998. Evaluation of
an Immersive Collaborative Virtual Learning Environment
for K-12 Education. American Educational Research
Association Annual Meeting (AERA), San Diego, CA.
[22] Viola, P. A. & Jones, M. J. 2001. Rapid object detection
using a boosted cascade of simple features. Proceedings of
the Conference on Computer Vision and Pattern Recognition
(CVPR'01), IEEE Computer Society, 511-518.
[23] Weinreich, U. 1953/1974. Languages in Contact, Berlin:
Walter de Gruyter.
[24] Weizenbaum, J. 1976. Computer Power and Human Reason:
From Judgment To Calculation. San Francisco: W. H.
Freeman.
[25] Yang, D. & Powers, D.M.W. 2006. Word sense
disambiguation using lexical cohesion in the context. Joint
Conference of the International Committee on
Computational Linguistics and the Association for
Computational Linguistics (COLING-ACL) 2006, Sydney,
Australia, 929-936.
[26] Yang, D. & Powers, D.M.W. 2005. Measuring semantic
similarity in the taxonomy of WordNet. Australia Computer
Science Conference (ACSC2005). pp315-322
[27] Yang, D. & Powers, D.M.W. 2008. Automatic Thesaurus
Construction, Australia Computer Science Conference
(ACSC2009), pp147-156