Accommodation of voice parameters in dialogue
-
Upload
uni-stuttgart -
Category
Documents
-
view
1 -
download
0
Transcript of Accommodation of voice parameters in dialogue
Institut fur Maschinelle SprachverarbeitungUniversitat StuttgartPfaffenwaldring 5bD-70569 Stuttgart
Accommodation of voice parameters in dialogue
Melanie HoffmannMasterarbeit
Prufer: Prof. Dr. Grzegorz DogilBetreuerin: Dr. Natalie Lewandowski
Beginn der Arbeit: 28. Februar 2014Ende der Arbeit: 31. Juli 2014
Eigenstandigkeitserklarung
Hiermit versichere ich,
• dass ich die vorliegende Arbeit selbststandig verfasst habe,
• dass ich keine anderen als die angegebenen Quellen benutzt und allewortlich oder sinngemaß aus anderen Werken ubernommenen Aussagenals solche gekennzeichnet habe,
• dass die eingereichte Arbeit weder vollstandig noch in wesentlichen TeilenGegenstand eines anderen Prufungsverfahrens gewesen ist,
• dass ich die Arbeit weder vollstandig noch in Teilen bereits veroffentlichthabe und
• dass das elektronische Exemplar mit den anderen Exemplaren uberein-stimmt.
Stuttgart, den 31. Juli 2014
Melanie Hoffmann
3
Acknowledgements
Ich mochte gerne meiner Betreuerin Natalie Lewandowski danken - dafur, dassihre Tur immer offen stand, wenn ich Fragen hatte, fur ihre Geduld und Zeitund fur ihre sehr hilfreichen Kommentare und Tipps. Eine bessere Berteuerinkann man sich kaum wunschen!
A huge thank you goes to Charlie P., who read through my work. I’m sure hemade it more joyful to read.
Ich mochte außerdem meinen Eltern danken. Danke, dass ihr immer an michglaubt und fur eure Unterstutzung! Papa - danke, fur die vielen hifreichenGesprache. Mama - danke fur das Umsorgen und fur das Versprechen unserergemeinsamen Reise! Ich freue mich sehr darauf.
Lieben Dank auch an Jan W. fur das viele Mut zusprechen und Gluck wunschen,nicht nur in Bezug auf diese Arbeit. Du warst eine große Unterstutzung.
4
Contents
Abstract 13
1 Introduction 15
2 Accommodation in dialogue 172.1 Communication Accommodation Theory . . . . . . . . . . . . . 172.2 Phonetic convergence . . . . . . . . . . . . . . . . . . . . . . . . 22
3 The perception of voice 273.1 Listeners’ judgements from voice perception . . . . . . . . . . . 28
3.1.1 Physical characteristics of the speaker . . . . . . . . . . . 283.1.2 Psychological characteristics of the speaker . . . . . . . . 313.1.3 Social characteristics of the speaker . . . . . . . . . . . . 35
3.2 Voice and exemplar theory . . . . . . . . . . . . . . . . . . . . . 37
4 Method 434.1 The Praat Voice Report . . . . . . . . . . . . . . . . . . . . . . 434.2 Corpus analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Results 515.1 Analysis 1 - Individual speakers . . . . . . . . . . . . . . . . . . 525.2 Analysis 2 - Speaker F . . . . . . . . . . . . . . . . . . . . . . . 565.3 Analysis 3 - Position in dialogue . . . . . . . . . . . . . . . . . . 58
6 Discussion 656.1 Discussion of method . . . . . . . . . . . . . . . . . . . . . . . . 656.2 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.1 Discussion of Analysis 1 - Individual speakers . . . . . . 696.2.2 Discussion of Analysis 2 - Speaker F . . . . . . . . . . . 726.2.3 Discussion of Analysis 3 - Position in dialogue . . . . . . 73
6.3 Changes in fundamental frequency . . . . . . . . . . . . . . . . 746.4 Temporal aspects . . . . . . . . . . . . . . . . . . . . . . . . . . 756.5 Engagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7 Conclusion and Outlook 77
Bibliography 79
5
Contents
Appendices 95
A Outliers - Individual speakers 97
B Outliers - speaker F 99
C Results of Analysis 1 - Individual speakers 101C.1 Dialogue A-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101C.2 Dialogue C-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103C.3 Dialogue H-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105C.4 Dialogue J-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107C.5 Dialogue K-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
D Results of Analysis 3 - Position in dialogue 111D.1 Dialogue A-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111D.2 Dialogue D-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115D.3 Dialogue H-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118D.4 Dialogue J-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122D.5 Dialogue K-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6
List of Tables
3.1 Possible judgements from the perception of voice. . . . . . . . . 293.2 Relations between acoustic parameters and emotions. . . . . . . 31
4.1 Sum of outliers - Individual speakers. . . . . . . . . . . . . . . . 494.2 Sum of outliers - Speaker F. . . . . . . . . . . . . . . . . . . . . 49
5.1 Paired samples statistics, Analysis 1- Speaker D. . . . . . . . . . 525.2 Paired t-tests, Analysis 1 - Speaker D. . . . . . . . . . . . . . . 535.3 Paired samples statistics, Analysis 1 - Speaker F. . . . . . . . . 535.4 Results of the paired t-tests, Analysis 1 - Speaker F. . . . . . . . 545.5 Summary of Analysis 1 - Dialogue D-F. . . . . . . . . . . . . . . 545.6 Summary of Analysis 1. . . . . . . . . . . . . . . . . . . . . . . 555.7 Paired samples statistics, Analysis 2 - Speaker F. . . . . . . . . 575.8 Paired t-tests, Analysis 2 - Speaker F. . . . . . . . . . . . . . . . 575.9 Summary of Analysis 2 - Speaker F. . . . . . . . . . . . . . . . . 575.10 Descriptive statistics, Analysis 3, Dialogue C-F, F0 Mean. . . . 585.11 Results of ANOVA of Analysis 3, Dialogue C-F, F0 Mean. . . . 595.12 Descriptive statistics, Analysis 3, Dialogue C-F, F0 SD. . . . . . 595.13 Results of ANOVA of Analysis 3, Dialogue C-F, F0 SD. . . . . . 595.14 Descriptive statistics, Analysis 3, Dialogue C-F, F0 Min . . . . . 605.15 Results of ANOVA of Analysis 3, Dialogue C-F, F0 Min. . . . . 605.16 Descriptive statistics, Analysis 3, Dialogue C-F, F0 Max. . . . . 615.17 Results of ANOVA of Analysis 3, Dialogue C-F, F0 Max. . . . . 615.18 Descriptive statistics, Analysis 3, Dialogue C-F, jitter. . . . . . . 615.19 Results of ANOVA of Analysis 3, Dialogue C-F, jitter. . . . . . 625.20 Descriptive statistics, Analysis 3, Dialogue C-F, shimmer. . . . . 625.21 Results of ANOVA of Analysis 3, Dialogue C-F, shimmer. . . . 635.22 Descriptive statistics, Analysis 3, Dialogue C-F, HNR. . . . . . . 635.23 Results of ANOVA of Analysis 3, Dialogue C-F, HNR. . . . . . 635.24 Summary of Analysis 3 - Position in dialogue . . . . . . . . . . . 64
6.1 Mean and standard deviation for jitter, shimmer and HNR. . . . 68
7
List of Figures
2.1 Interactive Alignment Model. . . . . . . . . . . . . . . . . . . . 242.2 Hybrid model of convergence. . . . . . . . . . . . . . . . . . . . 25
3.1 Perception-production loop in exemplar theory. . . . . . . . . . 41
4.1 Perturbation: jitter and shimmer. . . . . . . . . . . . . . . . . . 454.2 Dialogues of speaker F. . . . . . . . . . . . . . . . . . . . . . . . 474.3 Boxplot, outliers F0 Min. . . . . . . . . . . . . . . . . . . . . . . 48
6.1 Error of measurement for F0 Min in Praat. . . . . . . . . . . . . 666.2 Error of measurement for F0 Max in Praat. . . . . . . . . . . . 66
9
List of Abbreviations
ANOVA Analysis of varianceCAT Communication Accommodation TheoryCNN Cable News NetworkF0 Mean Mean fundamental frequencyF0 Min Minimum of the fundamental frequencyF0 Max Maximum of the fundamental frequencyGeCo German conversations corpusHNR Harmonics-to-noise ratioIAM Interactive Alignment ModelIQ Intelligence quotientMDVP Multi-Dimensional Voice ProgramNHR Noise-to-harmonics ratioRP Received pronunciationSAT Speech Accommodation TheorySD Standard deviationStd. Error Mean Standard Error MeanVOT Voice onset time
11
Abstract
Communication, according to Communication Accommodation Theory, is de-
pendent on social and situational factors. Due to the emotional states, the
amount of attention paid, the impression of the conversational partner and
the situation, speakers can become more similar or more dissimilar towards
their conversational partner in behaviour and speech, or they can even main-
tain their own style.
The aim of the present thesis was to investigate phonetic convergence, de-
fined as the increasing acoustic similarity of speech, in parameters of voice. The
intention was to identify those voice parameters that are sensitive to changes
due to accommodation towards the conversational partner.
Samples from six different dialogues of the GeCo corpus [SL13, SLD14], con-
sisting of German spontaneous speech of female speakers, were extracted from
the first and the last five minutes. Each sample was analysed for parameters of
voice (mean, minimum, maximum and standard deviation of the fundamental
frequency, jitter, shimmer and harmonics-to-noise ratio) using the Praat Voice
Report [BW14]. Evaluations of the results were made with paired t-tests and
ANOVAs. Additionally, ratings of the speakers for their conversational partner
about social attractiveness and competence were included in the evaluation of
the outcome.
Results revealed that the individual speakers changed on different voice pa-
rameters. Thus it can be assumed that accommodation is not a fully automatic
process, but is also dependent on the situation, on the conversational partner
and on speaker’s individual characteristics. Furthermore, analyses indicate
that parameters of the fundamental frequency are sensitive towards accom-
modation. Tendencies for convergence, divergence and maintenance of the
results were partly confirmed by speakers’ ratings about social attractiveness
and competence.
13
1 Introduction
Whenever we perceive the voice of another person, whether while reading, chat-
ting, screaming or even whispering, we draw information about the speaker.
We can make statements, about whether the person is male or female, if he
or she is young or old, in which mood or emotional states he or she is, if he
or she is ill and where he or she is from. The impressions we have are not
necessarily correct, but nevertheless lead to the drawing of an “auditory face”
[BFB04], that allows to recognize individuals, emotional states and aspects of
personality and heritage.
We can adjust our voice, according to the perceived acoustic parameters of
our conversational partner’s voice and the information that we draw from it.
The Communication Accommodation Theory (Chapter 2.1), referring
to behaviour which includes speech, states that convergence (becoming more
similar to the conversational partner), divergence (becoming more dissimilar to
the conversational partner) and maintenance (persisting in one’s own original
style) are socially motivated. Thereby we can define and express the social
distance to our conversational partner.
Phonetic convergence (Chapter 2.2) then refers to the acoustic charac-
teristics of the increasing similarity of speech. Speakers hereby adopt the
acoustic-phonetic features of the conversational partner. When dealing with
phonetic convergence the question arises, to what extent is convergence con-
trollable and conscious to the speaker? The Interactive Alignment Model
proposes that the link of perception and production is directly coupled and
that convergence is thus mostly automatic. This has been criticized due to
the fact that divergence and maintenance would not be possible since speakers
would only be able to converge to their conversational partner. A hybrid model
then assumes that convergence is driven automatically, but can be influenced
through various factors, including the speaker, the conversational partner and
the situation.
15
1 Introduction
Chapter 3, The perception of voice, deals with different speaker charac-
teristics that listeners can draw from the perception of voices. Their judge-
ments do not have to be necessarily correct, but nevertheless they are made,
as about physical, psychological and also social characteristics of the speaker.
These details and pieces of information about the speaker are stored together
with acoustic and lexical information in the mind of the listener. In order to
converge towards the conversational partner listeners have to remember and
reuse this information. This process can be well explained by Exemplar The-
ory, which proposes that every stimulus, such as different voice parameters,
can be stored in mind as a detailed trace and can then serve as the basis of
recognition and production.
The methods used in the present thesis are presented in Chapter 4, Method.
There, the Praat Voice Report is presented including the voice parameters
investigated: the mean, standard deviation, minimum and maximum of the
fundamental frequency as well as jitter, shimmer and the harmonics-to-noise
ratio. The corpus used is the GeCo corpus, which consists of dialogues of
about 25 minutes of spontaneous speech of female German native speakers.
In Chapter 5, which describes the Results, three different analyses were
conducted with the help of SPPS, using paired t-tests and ANOVAs. The first
analysis investigates the speakers’ individual differences within the dialogues,
the second investigates differences for one speaker participating in all six dia-
logues and the third sheds light on differences in the voice parameters when
comparing beginning and end points of the dialogues.
In the Discussion (Chapter 6) the results of the individual analyses are
discussed. For the first analyses, the analysis of the individual speakers, the
speakers’ ratings about their conversational partners’ social attractiveness and
competence were included. Chapter 7, Conclusion and outlook, summa-
rizes the findings and provides an outlook to future investigations concerning
phonetic convergence.
16
2 Accommodation in dialogue
In the interaction of dialogue partners processes of accommodation can occur,
including adjustments in the direction of the conversational partner. Commu-
nication Accommodation Theory (CAT), described in Chapter 2.1, gives ex-
planations and speakers’ motivations for variation in behaviour and in speech.
Phonetic convergence, see Chapter 2.2, refers to listeners’ perception of an
acoustic signal that can be broken down to a set of features, which can be
reused in production. In addition, the degree of automaticity of convergence
is examined with the help of the Interactive Alignment Model and a hybrid
model.
2.1 Communication Accommodation Theory
Communication is context-sensitive: people tend to speak differently depend-
ing on conversational partners, e.g. colleagues, one’s parents or children. Con-
nections and relationships between humans are very complex and “suited in
an array of dynamic human features and personal attributes, such as experi-
ences, developmental process, and social and personal identity orientations”
[PG08, p. 25]. The assumption, according to Communication Accommodation
Theory (CAT), is that interpersonal and intergroup relationships are mediated
and maintained through communication [GG98].
CAT was proposed in the 1970s named Speech Accommodation Theory
(SAT) [Gil73]. The theory thereby focussed on motivations of speakers that
cause individuals to change speech behaviour [Gil73, GTB73]. The expansion
of the scope, including relational, contextual and identity processes in inter-
action, then lead to a redefinition of the theory to Communication Accom-
modation Theory, which combines the areas of social psychology, sociology,
sociolinguistics and communication [SGLP01]. CAT proposes that speakers
17
2 Accommodation in dialogue
use verbal as well as non-verbal communication to achieve a desired social dis-
tance between themselves and the conversational partners [SGLP01, PG04].
In other words, parameters like use of language, quality of voice, gestures, pos-
ture, body movements, physical proximity, eye contact and facial expressions
can be used to emphasize or reduce the social distance to the conversational
partner and can express social status differences, ethnic and group boundaries
as well as role or norm-specific behaviours [SGLP01, p. 34]. Therefore the
process of accommodation is assumed to be complex and context-sensitive.
Accommodation could be proved for diverse dimensions in verbal and non-
verbal behaviour. Individuals’ behaviour during interaction becomes similar;
for example foot shaking and face touching [CB99], facial expressions, smiling
and gestures [GO06, p. 295]. Also information density is found to become
more similar during dialogue [AJL87]. For verbal behaviour, accommodation
becomes, for example, obvious by the length, duration and frequency of pauses
[GH82, JF70], the duration of utterances [GH82, JF70], for accent [BG77] and
backchannels [SL12].
The goals of CAT are explaining speakers’ linguistic and behavioural choices,
in which ways speakers adjust their speech towards their conversational partner
and also in what ways speech is perceived, evaluated and reacted to [GG13].
In order to create, reduce or maintain social distance speakers use different
strategies, namely convergence, divergence and maintenance.
1. Convergence
Convergence describes a speaker’s adjustment of his or her speaking style
and behaviour to become more similar or synchronous to a conversa-
tional partner [SGLP01]. Therefore individuals adopt each other’s com-
municative behaviour and thus reduce the social distance. Convergence
“is typically associated with affiliation, social approval, compliance, and
communication effectiveness” [PG08, p. 19].
2. Divergence
During the process of divergence, speakers accentuate their individual
distinctiveness [BG77] and emphasize the differences between self and
other [GOG05, SGLP01]. Hereby they display antipathy and social dis-
approval for the conversational partner [SGLP01].
3. Maintenance
The strategy of maintenance of “attempted non-convergence and non-
18
2.1 Communication Accommodation Theory
divergence” [SGLP01, p. 35] describes the behaviour in which a person
persists in his or her original style. The communication behaviour of the
conversational partner is thereby not regarded by the speaker [GOG05].
Reasons for this might be the substantiation of the speaker’s identity or
autonomy without emphasizing it or another possibility might be a lack
of sensitivity [GO06, p. 297]. Maintenance is often regarded similar to
divergence [GO06].
Like divergence, the process of overaccommodation can have a negative ef-
fect for communication [PG08, p. 19]. Overaccommodation is defined as “a
category if miscommunication in which a participant perceives a speaker to
exceed the sociolinguistic behaviours deemed necessary for synchronized in-
teraction” [SGLP01, p. 38]. An example of overaccommodation is patronizing
speech (e.g. nurses speech to elderly clients [EN93] or medical students talking
to patients with disabilities [DBSA11]). Speech then consists of simplified use
of vocabulary and grammar and slow enunciation [PG08, p. 19].
Conversely underaccommodation “specifies communication environments where
speakers do not afford their listeners adequate rights or space in conversational
interaction” [CJ97, p. 243]. Thus a speaker is perceived as not interested in
the conversation or not willing to exert effort for it [SGLP01].
Another strategy similar to divergence is speech complementarity [SGLP01,
GCC91]. Thereby speech is modified in a way which “accentuates valued soci-
olinguistic differences between interlocutors occupying different roles” [SGLP01,
p. 35]. An example for that is a study by Hogg in which men and women
changed speech behaviour during dialogue with each other [Hog85]. Men were
likely to use more masculine sounding voices and women were likely to use
a more female and soft sounding voice than they did in dialogues in which
the conversational partners were of the same sex. This phenomenon can be
explained by traditional sex-role ideologies [GCC91]. A women for example
might try to gain a man’s approval and wants to seem attractive to him and
thus uses a soft voice, but still she might converge to his dialect and/or other
parameters. Thus convergence can be accompanied by speech complementar-
ity.
Accommodation can occur in several dimensions as unimodal or multimodal
accommodation [SGLP01]. Unimodal accommodation describes a change of
only one layer, whereas multimodal accommodation occurs on several layers.
19
2 Accommodation in dialogue
Additionally, accommodation can occur to different degrees, partial (interac-
tion partners converge slightly to each other) or full (behaviour of interaction
partners matches exactly) [SGLP01]. Interactions can also be symmetrical
or asymmetrical [SGLP01, p. 37]. Symmetrical accommodation describes an
interaction in which interaction partners behave equally (e.g. two previously
unknown people that come to be workmates [GP75, p. 177]), whereas in
asymmetrical interaction partners do not (e.g. job candidate and interviewer
[GP75, p. 176]). Additionally, the direction of accommodation can be described
[SGLP01]. Unidirectional accommodation describes the process in which only
one interaction partner accommodates in his or her behaviour, mutual accom-
modation means that both interaction partners accommodate.
Another important distinction is whether convergence or divergence are di-
rected upwards or downwards [GP75, SGLP01, GO06]. Upward convergence
occurs when one speaker adapts the prestige speech patterns of his conver-
sational partner and thus becomes more similar to him or her. Gregory and
Webster, for example, found that the pitch of voice of Larry King, the talk show
host on the CNN (Cabel News Network) Larry King Live talk show, was found
to be an indicator for the social status of his guests [GW96]. While Larry King
would converge to high-status guests, lower-status guests tend to converge to
him. Contrary, downwards convergence implies that one conversational part-
ner converges to the other who possesses less prestigious patterns. Azuma
found occurrence of downwards convergence in speeches of Japan’s emperor
Hirohito after the Pacific War [Azu97]1. Upward divergence then “can be in-
terpreted as indicating the sender’s desire to appear superior to the receiver in
social status and competence” [GP75, p. 178]. Hence the speaker wants to be
recognized as having a higher social status. Also downwards movements exist
- downward divergence then describes the “emphasis of one’s low-prestige mi-
nority heritage” [GO06, p. 295] or “down-to-earthness [and] toughness” [GP75,
p. 178].
Next to form, degree and direction of accommodation, speakers’ motivation
plays a role in CAT. Giles and Powesland distinguish between two kinds of
factors that might affect the speech behaviour of individuals: endogenous fac-
tors and exogenous factors [GP75]. Endogenous factors deal with the speaker’s
“physiological and emotional states at the time of interaction” [GP75, p. 119],
1According to Shepard et al. this result could also reflect overaccommodation on side ofthe conversational partners of Hirohito, who showed upward convergence, while Hirihitoshowed downward convergence [SGLP01, p. 37]
20
2.1 Communication Accommodation Theory
so factors that are internal to the speaker. Anxiety, for example, can thus
influence a speaker’s speech rate and pronunciation and can cause vocal dis-
turbances as well. Exogenous features, on the other hand, “are external to the
sender but present in the immediate social situation” [GP75, p. 118], such as
aspects of topic and context 2.
Beside the internal state of an individual and the context and topic of the
conversation attention has also be paid to the conversational partner. De-
pending on whom individuals are talking to they adjust their speech and be-
haviour. When talking to an unknown person, one of the first clues to the
characteristics of the conversational partner is his or her physical appearance.
Thus one hypothesis is, that sex/gender [NNS02, Par06, Lew12, Bab12], race
[Bab12] and social status [GW96] might play a crucial role in accommodation
[GP75]. Up to now there is no clear evidence for these “macro social factors”
[ACG+11, p. 192] being significant. Furthermore, the experiment by Abrego-
Collier and colleagues showed that neither gender nor the sexual orientation
of the speaker were found to be significant, but the personal opinion about
the speaker, which is built situationally and influences the direction of accom-
modation [ACG+11]. These include speakers’ motivations like proposed in the
Similarity Attraction Hypothesis [Byr71], which states that individuals try to
be more similar to people whom they are attracted to. Other proposed moti-
vations are the need of the speaker to gain approval from the conversational
partner [SG82] and the speaker’s concern for arranging the conversation un-
problematically and smoothly [GGJ+95] and/or to accomplish mutual goals
(as joint project) [Cla96]. It is also possible that speaker’s intelligibility might
increase during the interaction [Tri60] and his or her desire to reduce the social
distance [SGLP01] and to achieve interpersonal liking [CB99, p. 901].
In addition, intervisibility influences accommodation/convergence. Differ-
ences in speech behaviour occurred when individuals were able to see the
person they are listening to. Babel found more convergence in vowel spec-
tra when the participants of her experiment could see a picture of the speaker
who produced the signal [Bab12] and Schweitzer and Lewandowski observed a
higher articulation rate, when individuals were able to see their conversational
partner they are talking to [SL13].
In the following chapter phonetic convergence is described, which is “the
2Due to the fact that topic can influence the individual’s emotional state the distinctionbetween endogenous and exogenous factors can be indistinct.
21
2 Accommodation in dialogue
process in which a talker acquires acoustic characteristics of the individual
they are interacting with” [Bab09, p. 3].
2.2 Phonetic convergence
Phonetic convergence is defined as the increase of segmental and suprasegmen-
tal similarities in speech [Par06] (also called phonetic imitation or phonetic
accommodation [Bab09, Bab12]). Thereby individuals take in the acoustic
characteristics of their conversational partner [Bab09].
When investigating how conversational partners accommodate/converge,
one should also consider the link between perception and production [SF97].
A person has to perceive the speech of the conversational partner in order to
reuse its features in his or her own. It is an open question as to which features
a person relies on. Several studies have investigated individual features:
• VOT [Nie07, Nie10, SF97]
• speech rate [SL12, Web70]
• amplitude:
amplitude envelopes [Lew12]
amplitude contour [Gre90]
• intensity [GH82, Nat75, LH11]
• fundamental frequency [GW96, GDW97, BB11, LH11]
• vowel quality [Bab09, Par10, PGSK12]
• jitter and shimmer [LH11]
• harmonics [LH11]
It might also be possible that speakers perceive (and reuse) not only single fea-
tures, but also different combinations of features. Speaker-specific differences
can also occur. In order to understand how individuals are able to reuse the
perceived features, one has to deal with models of the link between perception
and production. These are discussed in Chapter 3 (The perception of voice).
When dealing with the perception and production of acoustic parameters an
important question that arises is, to what extent are processes of convergence
22
2.2 Phonetic convergence
controllable by the speaker, to what extent are they automatic and what influ-
ence do social factors have? The Communication Accommodation Theory (see
Chapter 2.1) regards changes in behaviour and speech socially motivated and
thus probably subconscious [Bab09, p. 20], whereas the Interactive Alignment
Model by Pickering and Garrod assumes a direct coupling between language
perception and language production processes [PG04].
The model is based on the assumption of Dijksterhuis and Bargh that imi-
tation is purely based on perception and that “no motivation is required, nor
a conscious decision” [DB01, p. 32]. Therefore imitation is a natural and au-
tomatic process with the caveat that other processes can intervene and inhibit
the process of accommodation. Pickering and Garrod adopt this assumption
for speech behaviour (in dialogues) and assume that “production and compre-
hension become tightly coupled in a way that leads to the automatic alignment
of linguistic representations at many levels” [PG04, p. 170] (e.g. syntactic, se-
mantic and phonological levels). Due to the interconnection of these levels
the alignment of one level automatically leads to the alignment of other lev-
els. Agreement on these levels lead to a mutual understanding between the
conversational partners. Aligned situation models, which contain information
about “space, time, causality, intentionality and currently relevant individu-
als” [GP04, p. 8] are the results of the interaction of the conversational partners
[PG04].
Figure 2.1 displays the proposed channels of alignment (by horizontal, dashed
arrows). The “channels are direct and automatic” [PG04, p.177] according to
priming. Priming describes the process by which the “activation of a represen-
tation in one interlocutor leads to the activation of the matching representation
in the other interlocutor directly” [PG04, p. 177]. A representation used for
the purposes of comprehension can then be reused for production. Pickering
and Garrod call this parity [PG04, p. 177]. If parity exists the neuronal infras-
tructure for speaking and listening should be the same [MPG12]. Evidence for
this has been found by Menenti et al. [MPG12] and by Garnier and colleagues
[GLS13]. Proof for automatic and unconscious alignment was also found by
Lewandowski [Lew12]. In her experiments with dialogues of native and non-
native speakers, the native speakers were explicitly instructed not to converge
to the nonnative speakers, but the results indicate that they nevertheless did
so.
In summary, the IAM proposes that the link of perception and production
23
2 Accommodation in dialogue
Figure 2.1: Interaction Alignment Model proposed by Pickering and Garrod[MPG12, p. 2]. Different schematic levels of comprehension and pro-duction are linked with each other. The dashed lines represent thechannels of alignment.
is direct and that alignment is an automatic process. The model has been
criticized due to the fact that no processes or steps exist that could counteract
automatic alignment. According to CAT three different strategies exist: con-
vergence, divergence and maintenance (in order to represent social distance).
If alignment is an automatic process, maintenance and divergence would con-
sequently not be possible because speakers would only be able to converge.
Indeed evidence has been found that linguistic knowledge has influence on
convergence. Nielsen found that participants in her experiment imitated an ex-
tended voice onset time (VOT) after listening to words with extended VOT on
the phoneme /p/, but they did not imitate reduced VOT [Nie07, Nie10]. She
suggests that imitation of the reduced VOT would have introduced phono-
logical ambiguity to the corresponding voiced plosives, while there were no
such ambiguities in imitating extended VOT. Babel also found that speakers
did not imitate all vowels heard from a model talker [Bab09]. These find-
ings indicate that phonetic imitation is not an automatic process, but a se-
lective one that can be modelled through linguistic features. In addition, rat-
ings of mutual attractiveness and/or liking have been found being influential
[ACG+11, Nat75, PG08, SL13].
Hence Krauss and Pardo emphasized the need for a hybrid model “in which
24
2.2 Phonetic convergence
alignment or imitation derives from both the kinds of automatic processes they
[Pickering and Garrod] describe and processes that are more direct or reflec-
tive” [KP04, p. 203] so socially motivated and conscious changes in speech can
also be considered in combination with the unconscious and automatic aspects
of alignment. Lewandowski proposes such a hybrid model [Lew12], which co-
vers different kinds of factors that can influence convergence (see Figure 2.2).
Figure 2.2: Hybrid model of convergence, including automatic and subconsciousfactors [Lew12, p. 205].
In the proposed hybrid model automatic processes and those which are said
to be conscious or rather subconscious are merged. In addition, social as-
pects, the situational context and the speaker’s personality and abilities are
integrated. The model shows that the convergence mechanism automatically
yields convergence, but can be influenced or alleviated through the evalua-
tion of the dialogue partner, the situational context and the speaker himself,
including his personality, psychological features, linguistic prerequisites and
phonetic talent as it has been shown that more phonetically-talented speakers
converge more to their partners than less talented ones (in native-nonnative
dialogues) [Lew12]. Important are also the linguistic prerequisites that speak-
ers have concerning different languages and dialects [Lew12, p. 205]. Linguistic
structures are stored in the speaker as well. Individual differences (personality
and psychological features) of the single speakers form the frame for the de-
gree of convergence. The evaluation of the dialogue partner (e.g. social status
attractiveness, friendliness and sympathy), the situational context and social
goals like the need for social approval are the parameters on which a speaker
25
2 Accommodation in dialogue
evaluates whether to converge, diverge or maintain.
Another important impact on convergence is attention, which “may adjust
the grain of perceptual resolution” [Par06, p. 2389]. If listeners are distracted
and/or do not listen carefully to their conversational partner, speech may not
be perceived in a detailed way. Also the process of the memory plays an
important role as perceived speech modalities and contents are stored there.
Experiments by Gordon and colleagues proved that indeed “attention plays a
role in the perception of phonetic segments and that the relative importance
of acoustic cues depends on the amount of attention that is devoted to the
speech stimulus” [GER93, p. 33].
26
3 The perception of voice
An utterance contains not only linguistic information “but also a great deal
of information for the listener about the characteristics of the speaker him-
self” [Lav68, p. 43]. Thus utterances can convey information from a speaker
to a listener on several layers as an abstract form, containing the structures
of a language, and a discrete form as the produced sounds. Many theories
therefore distinguish the concepts of language, the abstract system of gram-
mar, and speech, the vocal-motor-performance [KS11, p. 303] 1. The layer of
language contains several levels of linguistic information (such as e.g. phono-
logical, morphological, syntactic and semantic levels) which then represent the
content of an utterance [LP05, p. 203]. In other words, the speaker structures
and arranges the intended information, so that the listener can extract the in-
formation. The layer of speech, on the other hand, conveys information about
the speaker him- or herself (next to other parameters, e.g. facial expressions,
gestures, posture, etc.). Abercrombie refers to these conveyed features as in-
dexical properties which “[. . . ] may fulfil other functions which may sometimes
even be more important than linguistic communication, and which can never
be completely ignored” [Abe67, p. 5].
The two concepts of language and speech are carried out simultaneously -
this means that the speaker generates an acoustic signal, the listener receives
this signal and can then extract the information from both sources. Taking
the broad definition of voice2 into account the term voice includes the details
of the vocal fold motions and “the acoustic results of the coordinated action
of the respiratory system, jaw, lips and soft palate, both with respect to their
average values and the amount and pattern of variability in values over time”
[KS11, p. 6]. In this sense voice shares many similarities with the layer of
speech just described.
1Ferdinand de Saussure distinguishes langue et parole [Sau01]. John Laver distinguishesbetween the paralinguistic, extralinguistic and linguistic layers [Lav03].
2The narrow definition of voice is distinguished from speech and synonymously the termlaryngeal source, relating to the vocal fold vibrations exclusively [KS11, p. 5].
27
3 The perception of voice
3.1 Listeners’ judgements from voice
perception
According to the perception of the conversational partner’s verbal an non-
verbal behaviour people are “obliged to make a continuous stream of judge-
ments [. . . ] about a wide spectrum of information” [Lav76/97, p. 31]. Lis-
teners are not only able to draw conclusions from the content of what their
conversational partner is conveying, but also get an impression of his or her
personal characteristics - such as his or her physical and psychological char-
acteristics and social attributes. These conclusions about the conversational
partner “shape our own behaviour into an appropriate relationship with him”
[Lav76/97, p. 31].
Table 3.1 shows some judgements that listeners can make when listening
to voices [KS11, p. 2]. Through the perception of voice listeners get impres-
sions about physiological characteristics of the speaker, such as age and height,
psychological characteristics, such as arousal and stress, as well as social char-
acteristics as the social status. These are not necessarily accurate [KS11, p. 1]
but nevertheless they affect subsequent interaction.
In the Chapters 3.1.1-3.1.3 below, characteristics of the speaker that listeners
can perceive are presented.
3.1.1 Physical characteristics of the speaker
Every vocal tract shows individual differences which leads to acoustically dis-
tinctive speech productions. These differences occur due to anatomical differ-
ences (such as size and shape of the vocal tract) from which listeners can draw
conclusions about the speaker’s age, sex, body size and body height. Krauss
and colleagues found evidence that listeners can identify a speaker (from two
photos shown) from his or her voice better than chance (76.5 %) and that
they can estimate age and height only a little less accurate as from a photo
[KFM02]. A study by Ryalls and colleagues showed that older people exhibit
shorter average positive VOT values for unvoiced plosives [RST04]. They ex-
plain these findings by decreased lung volumes that older speakers often have
and subsequently a lower speaking rate. Studies also showed the development
28
3.1 Listeners’ judgements from voice perception
Physical characteristics of the speakerAgeAppearance (height, weight, attractiveness)Dental/oral/nasal statusHealth statusVocal fatigueIntoxicationRace, ethnicitySexSmoker/non-Smoker
Psychological characteristics of the speakerArousal (relaxed, hurried)CompetenceEmotional status/moodIntelligencePersonalityPsychiatric statusStressTruthfulness
Social characteristics of the speakerEducationOccupationRegional originRole in conversational settingSocial statusSexual orientation
Table 3.1: Judgements listeners can make from voice perception [KS11, p. 2], mo-dified.
of the fundamental frequency for men and women with age are caused by
different ratios of hormones [TE95].
In discriminating races, the distinction between racial profiling, which is
“based on visual cues that result in the confirmation of or in speculation con-
cerning the racial background of an individual or individuals” [Bau00, p. 363],
and linguistic profiling, which is “based upon auditory cues that may be used
to identify an individual or individuals as belonging to a linguistic subgroup
within a given speech community, including a racial subgroup” [Bau00, p. 396],
are made.
Concerning linguistic profiling, evidence has been found that listeners in-
deed can extract information from acoustic cues. Newman and Wu asked
New Yorkers to identify race and national heritage from other New Yorkers
[NW11]. Participants then heard speech samples from Chinese, Korean, Euro-
29
3 The perception of voice
pean, Latino and African Americans and categorized these into black, white,
Hispanic or Asian. Results indicated that judgements thereby were better
than chance. A phonetic analysis showed then that the different speakers dif-
fer in VOT, breathiness of voices, levels of productions of /E/ and /r/ as well
as in rhythm. Ryalls and colleagues also observed that participants with dif-
ferent ethnic backgrounds (Afro-American and Caucasian-American) differed
in their durations of positive and negative VOT [RZB97]. In addition, their
study showed that also sex plays a role on the durations of positive and neg-
ative VOT. Other typical differences for men and women is the fundamental
frequency. Typical values are 120 Hz for men and 210 Hz for women [TE95]3.
Along with age, sex, race and ethnicity, the health status of the speaker can
be heard, as for example vocal diseases of the organic types like polyps (lesion
on the anterior third of the vocal fold [WJM+04, p. 125]) or vocal fold nodules
(symmetric, small lesions occurring on both side of the vocal fold [WJM+04,
p. 125]). They can be caused by vocal overuse like singing, shouting and long
and loud usage of voice. Petrovic-Lazic and colleagues found evidence that
patients with polyps had different values for jitter, shimmer, variation of the
fundamental frequency and harmonics-to-noise ratio (and others) than speak-
ers without polyps and that these values improved after surgery [PLB+09].
Also the dental, oral and nasal status can be heard from voice because the
place and manner of articulation plays a role in sound production (of conso-
nants). Through the modification of the articulators (e.g. lips, teeth, parts
of the tongue, glottis) and also different manners of articulation (e.g. produc-
tion of nasals, plosives, fricatives) different sounds can be produced and it is
possible to deduce information about the status of the speech organs. During
a cold or an allergic reaction, for example, mucous membranes of the upper
respiratory system are swollen, leading to a nasal voice [Lav68, p. 47].
Vocal fatigue, defined as “negative vocal adaptation that occurs as a con-
sequence of prolonged voice use” [WM03, p. 22] such as singing or reading
out loud, also affects voice. Evidence shows that mean fundamental frequency
increases after long periods of reading out text [SSL94]. Reported effects on
jitter, shimmer and harmonics-to-noise ratio differ between studies and exper-
imental settings.
3Statements about the fundamental frequency and the standard deviation differ in litera-ture ([TE95]).
30
3.1 Listeners’ judgements from voice perception
In addition, smoking affects the quality of voice. Several studies showed
that smoking leads to a lowered mean fundamental frequency of speakers
[SH82, MD87] and also differences in harmonics-to-noise ratio and shimmer
[Bra94]. Intoxication can also be perceived by listeners. Alcohol affects hu-
man cognitive, motor and sensory processes and thus leads also to changes
in speech and voice [KS11, p. 357]. Schiel found that listeners can detect al-
coholic intoxication better in female voices than in male voices and better in
read speech than in spontaneous speech [Schi11]. Hollien and colleagues found
that a raise in fundamental frequency, a slow speaking rate and nonfluencies
in speech can be observed for intoxicated participants [HDJ+01].
3.1.2 Psychological characteristics of the speaker
Studying the effects of emotion on voice is difficult because questions arise as
to what exactly emotions are, how many of them exist and what type of fine-
grained distinctions can be made (e.g. anger can be expressed very tempered
and controlled or uncontrolled like in a rage attack). A tendency for affecting
acoustic parameters has been shown in studies for the emotions of sadness,
fear, anger, happiness and boredom. A summary [KS11, p. 321] can be seen
in Table 3.2.
Sadness Fear Anger Joy/ BoredomHappiness
F0 slightly very much very much much lower ormean lower higher higher higher normal
F0 (sightly) wider or much much morerange more narrower wider wider monotone
monotone or normal
F0 downward normal or abrupt smooth, lessvariability/ inflections abrupt changes/ upward variabilitycontour changes/ upward changes
upward inflectionsinflections
Intensity quieter normal louder louder quieter
Speaking slightly much slightly faster or slowerrate lower faster faster slower
Spectral less high- more or more more lessslope frequency less
energy
Table 3.2: Relations between acoustic parameters and emotions, adopted from[KS11, p. 321].
31
3 The perception of voice
The emotions mentioned in Table 3.2, such as sadness and joy/happiness,
can be divided into more active and more passive emotions which can be
distinguished through the level of arousal. Fear, anger, joy and happiness
can be grouped as activation emotions. They exhibit a higher fundamental
frequency (F0), more fundamental frequency variability, faster speech rate,
increased intensity and increases in high-frequency energy [KS11, p. 325]. It
was concluded that there is a direct relation between arousal and physiological
effect [Sch86, WS72]. Emotional arousal cause an increase in aspiration rate as
well as changes in articulation and phonation [Sch03, p. 229]. This leads to an
increase in subglottal pressure, F0 and intensity. Similarly speech durations
between breaths are shortened and the typical speaking rhythm of a speaker
may be altered [KS11, p. 325]. Also, stress as an emotional load, exhibits
a raise in F0 and its standard variation [WJE+02] as well as in amplitude
[SMA+82]. More relaxed emotions, like sadness or quiet happiness, induce a
decreased motor control and less articulatory precision [WS72, p. 1239], which
could lead to increased jitter and shimmer and changes in the contour of the
fundamental frequency [KS11, p. 325].
It can be assumed that acoustic measures are useful identifying arousal,
but do not distinguish between the different emotions. Evidence for this has
been found by Laukka and colleagues [LJB05]. In their experiment, model
talkers spoke a sentence with different emotions (anger, fear, disgust, happiness
and sadness). Afterwards several vocal cues were measured and additionally
listeners had to rate the samples heard on different dimensions (activation,
valence, potency and emotional intensity). The results of their experiment
indicate that cues of activation and emotional intensity largely overlap (e.g.
for high mean and maximum F0, large variability of F0, mean and standard
deviation of intensity). No relation was found between the ratings of the
listener (whether the sentence was positive or negative) when comparing to
the acoustical cues. Laukka et al. conclude that it is possible that they did
not capture all cues that listeners use for their judgements.
Goudbeek and Scherer found that valence and potency might nevertheless
be expressed in voice, although arousal is dominant for many acoustical param-
eters [GS10]. When level of arousal was low, valency was reflected in spectral
slope and variability of intensity and potency was expressed through shimmer.
A high level of arousal was reflected in spectral slope and intensity as, it was in
the low levels of arousal, and additionally in intensity level and spectral noise.
32
3.1 Listeners’ judgements from voice perception
Potency in high levels of arousal was expressed through variability of inten-
sity, spectral shape and the level of fundamental frequency. These findings
indicate that valence and potency can be measured, although they are depen-
dent on arousal. Nevertheless individuals also seem to vary in their expression
of emotion [WS72].
Furthermore, there is evidence that verbal and emotional messages interact
in production. In the experiment of Nygaard and Queen participants were
exposed to spoken words with a happy, sad or neutral meaning (e.g. comedy
and cancer) [NQ08]. The tone of voice of the model talker thereby was congru-
ent or incongruent with the word’s meaning. Participants were able to repeat
the heard words more quickly when they were spoken in a tone of voice that
matched its meaning.
Additionally, there is evidence that the perception of emotions in voice is cul-
ture and language-specific [SBW01]. Scherer et al. let participants from nine
different countries in Europe, North America and Asia listen to speech samples
produced by German actors, spoken in different emotions (anger, sadness, joy,
fear and neutral tone). After listening to a sample participants assigned up
to two emotional labels for the sample heard. Overall, participants from this
experiment were able to infer the emotions with a degree of accuracy better
than chance. In addition, accuracy decreased with increasing language dissim-
ilarity from German. From these results, Scherer et al. concluded that there
are culture- and language-specific paralinguistic patterns which influence the
decoding process in the perception of the listeners.
Another judgement listeners make while listening to voices is about the
personality of a person. Personality itself is a broad concept [KS11, p. 343].
In the following the definition of Kreiman and Sidtis will be used, describing
emotion as a transient state and personality as “the more enduring (but by
no means stable or permanent) aspects of the manner in which individuals
respond to stimuli” [KS11, p. 342].
Evidence that listeners make judgements about the speaker’s personality
was found by Pear [Pea31]. He discovered the existence and importance of vo-
cal stereotyping. In his study judgements about personality were made from
voices of nine different speakers over the radio. Listeners were very accurate at
guessing the sex of the speakers (except for the eleven-year old child), their age
and partly accurate at guessing the profession of the speakers (especially for
the actor and the judge). Errors in guessing the profession of the speakers were
33
3 The perception of voice
nevertheless more or less consistent. More than half of the listeners thought
that the first speaker, a police detective, was working on a farm and most of
them believed that the eighth speaker, an electrical engineer, worked in a man-
ual trade. Although the assumptions of the listeners were not always correct,
they drew conclusions from the voices. Pear thus considers stereotyping as an
important aspect of the perception of personality from voice.
Ko and colleagues tried to determine what information about the speaker,
especially about the speaker’s competence and his or her warmth, can be
inferred by listeners [KJS09]. In their experiment male and female speakers
had to read out resumes with stereotypically masculine or feminine content.
Ratings about competence (associated with assertive and decisive) were solely
affected by vocal femininity (not by sex or type of resume) - thereby voices
rated low in femininity were perceived as more competent. Warm voices,
associative with being supportive and caring, on the other hand, correlated
with high feminine voices. In addition, Ko et al. tested whether femininity
correlated with babyishness and found that vocal femininity had on overlap
with vocal cues of babyishness (associated with weakness and incompetence).
In a study by Berry and et al. similar results were achieved [BHL+94].
Listeners rated voice samples of five-year old children counting to ten according
to competence, leadership, dominance, honesty, warmth, attractiveness and
babyishness. Attractive voices suggested the personality to be competent and
warm and for boys’ voices additionally leading. Babyish voices of boys and
girls were associated with less competence, dominance and leading qualities,
but honesty and additionally warmth for the voices of boys.
Competence is also related to speaking rate - faster rates of speech were
associated with higher competence [SBS75] and social attractiveness [SBP83],
although it needs to be remarked that listeners may perceive speech rate rela-
tive to their own habitual speech rate [SBP83]. Thus judgements about com-
petence and social attractiveness might be dependent on both, speaker and
listener.
In an experiment of Fay and Middleton listeners judged speakers’ intelli-
gence according to read speech [FM40]. To draw a conclusions between the
judgements and the intelligence of the speakers the intelligence quotient (IQ)
of each speaker was measured. Results indicate that listeners were fairly re-
liable in their judgements but that they could judge the intelligence of some
speakers more reliable than those of others and that they rated more intelligent
34
3.1 Listeners’ judgements from voice perception
speakers to be more intelligent in most of the cases. Another result was that
listeners seem to develop stereotypes of superior and inferior intelligence.
The studies described above show that listeners can consistently rate person-
ality traits from voices, although these might not always be accurate, and that
listeners draw stereotypes out of the quality of voices. Also McAleer et al.
found that personality judgements between judges of their experiment were
consistent when listening to the word ’hello’ produced by different speakers
[MATB14].
Additionally, there is evidence that psychiatric diseases often influence voice
parameters. Laukka et al. found that speakers who were afraid of speak-
ing in public due to social phobias exhibited a change in acoustic parameters
after treatment (mean and maximum fundamental frequency, high-frequency
components in the spectrum and pauses) [LLA+08]. Speakers with depression
tended to exhibit an increase in speech rate and a decrease in pausing with
proceeding treatment and mood change as well as a decrease in minimum fun-
damental frequency for women [ES96]. Listeners in the experiment of Todt
and Howell were able to significantly distinguish schizophrenic voices from
non-schizophrenic voices, describing them as more inefficient, despondent, and
moody [TH80]. Studies about the relations of psychiatric disease and voice
characteristics is a difficult field of study as individual differences, inconsistent
longitudinal changes and subjective diagnostic criteria complicate the studies
[KS11, p. 357].
3.1.3 Social characteristics of the speaker
Analyses showed that voices can be associated with social groups. In an ex-
periment of Moreau and colleagues Senegalese and European listeners heard
recordings of Senegalese Wolof speakers, a language mainly spoken in North
Senegal, and had to make statements about their social and caste status
[MTHH14]. Senegalese listeners were able to classify the social status better
than chance as well as European listeners, who did not have prior knowledge
about the Wolof language. The results of this experiment indicate that listen-
ers can, largely independent of knowledge about language structures and thus
relying on acoustic parameters, make statements about the social status of a
speaker.
35
3 The perception of voice
Esling, who recorded male speakers in Edinburgh, found that vocal settings
(modal, creaky, whispery, breathy, harsh voice and combinations of them),
defined by Laver [Lav68], correlated with the socio-economic status of the
speaker [Esl78]. Creaky voices were associated with a higher social status
whereas harsh voices were associated with lower social status.
In the experiment by Gregory and Webster, who analysed speech with long-
term averaged spectra, found that accommodation was dependent on the social
status [GW96]. Speakers with a lower social status converged to speakers with
a higher social status. Dominance also played a role in the experiment of
Gregory and Gallagher [GG02]. They analysed the fundamental frequency of
US presidential candidates of eight elections and discovered that the candidate,
who did not converge in his fundamental frequency, was likely to win the
election. Next to dominance, the role in the conversational setting influences
speech. Pardo found that instruction givers, who had to explain a path in a
map task to another participant of the experiment, converged to the receivers
[Par06]. Still an open question is whether the sex of the speakers plays a
roles in the perception of speaker role as inconclusive results were obtained
regarding this matter [NNS02, PJK10, Lew12].
Linville proved in a perception experiment that female listeners were capable
of making accurate judgements regarding men’s sexual orientation (straight or
gay) [Lin98]. The acoustic analysis found evidence that the phoneme /s/ was
produced differently for straight and gay men. Although speakers tend to per-
ceive the sexual orientation of their conversational partner (or a model talker)
it does not seem to have any influence on accommodation. Abrego-Collier et
al. let participants of their experiment listen to a first-person narrative about
going on a date with different outcomes [ACG+11]. In the negative version
the speaker abandons his date and goes home alone, in the positive version
he goes to the date and they leave together. Additionally the narrative differs
in the sexual orientation of the speaker (“straight” and “gay” condition). In
the experiment VOT of plosives was extended in the recordings. The results
of the experiment indicate that participants with a positive opinion about the
speaker showed an increase in VOT after the experiment. Neither the outcome
of the date nor the sexual orientation are suggested to have had influence on
the participants.
Regional origins, as for example expressed by dialects or accents, can also be
perceived by the listener. Nasalisation characterizes most speakers of Received
Pronunciation (RP) in England, several accents from the United States of
36
3.2 Voice and exemplar theory
America and also Australia. Velarisation, on the other hand, functions as are
regional marker for speakers from Birmingham, England and parts of New
York [Lav68, p. 50].
3.2 Voice and exemplar theory
As the previous chapters (Chapters 4.1.1-4.1.3) show, listeners are capable of
inferring several pieces of information about a speaker from voice perception.
The judgements they make are not necessarily accurate, but nevertheless they
lead to the formation of opinion about the model talker or conversational
partner, which guides further interaction.
In order to evaluate the conversational partner and to accommodate/converge
to his or her speech, listeners have to perceive (parameters of) their conversa-
tional partner’s speech. That means that listeners are able to adopt some of
the acoustic-phonetic features of their conversational partner. Therefore, there
has to be a link between the perception of the characteristics of the speech of
others and one’s own production. Many different models exist that propose
how perception and production are coupled. The present chapter deals with
the perception and storage of voice and voice details. These processes can be
well explained by exemplar theory.
In abstractionist theories the process of normalisation, mapping from speaker-
specific to a speaker-neutral abstraction, is assumed. Thereby voice informa-
tion would be discarded during speech perception and ideal, modality-free units
would be stored in the mental lexicon. According to the vocabulary previously
used, this would then mean that only language (including structure and con-
tent), excluding speech (information about the speaker himself, including voice
details), from the perceived acoustic signal would be kept in memory. Indeed
some experiments show tendencies that talker variability plays a role in mem-
ory, which contradicts the process of normalisation. Exemplar theory suggests
that individuals store detailed instances in memory and that they are able to
compare a new stimulus with these stored instances for perception.
Van Lancker et al. found that listeners were able to recognize voices, even
when played backwards (and thus without any language-specific and phonetic
information) [VLKE85]. Martin et al. had participants try to recall a word
list with monosyllabic English words in the right order [MMPS89]. The word
37
3 The perception of voice
list was either produced by one speaker or by ten different ones (each word
spoken by different speaker). The result of this experiment was that the recall
was better for the participants that listened to the word list produced by one
speaker, but only for items that occurred early in the word list.
In a further experiment Martin et al. let participants run through an addi-
tional preload memory task. Participants were shown digits on a screen and
had to listen to the words list from the previous experiment with one of the
different conditions afterwards. The result of this experiment was that recall of
the digits was better if the word list that was heard afterwards was produced
by one speaker. Martin et al. suggested that the processing of the words
produced by different speakers requires more resources for working memory as
words produced by one speaker.
Goldinger et al. were able to prove some of the results of Martin et al.
[GPL91]. Participants in their experiments also had to recall of a word list
with ten items in the right order. The word lists were also produced by one
speaker or by different speakers (as in the experiment by Martin et al.). The
words were then presented to the participants at varying speeds. At relatively
fast presentation rates, the results are comparable to those of Martin et al.
The recall of the participants was better for the word lists produced by one
speaker than those produced by different speakers. In contrast, recall was bet-
ter for world lists produced by multiple speakers when they were played at slow
presentation rates. Goldinger et al. concluded that voice information (along
with lexical information) is retained in long-term memory when participants
are given sufficient time during rehearsal and that it facilitates the retrieval of
words.
These results of the study could also be proven by Lightfoot, who made
experiments with the familiarity of speakers’ voices [Lig89]. Participants of
her experiments were trained to recognize the voices of different speakers.
Different fictional names were associated with the voices. Word lists produced
by different speakers were then better recalled than word lists produced by
single speakers, even at relatively fast presentation rates. Nygaard et al. also
proved that familiarity with speaker’s voices facilitates the recognition of novel
words produced by familiar voices [NSP94]. They concluded that speaker-
specific information was encoded and retained in long-term memory.
Palmeri et al. also investigated the relationship between word recognition
and the memory of voices [PGP93]. In their experiment, old and new words
38
3.2 Voice and exemplar theory
were presented to the participants, who decided if the word they heard was
new (played for the first time) or old (repetition of an already heard word).
The lag between the words was then manipulated. Up to 64 intervening words
were set between the repetitions of words. The words in the presented word
lists were also uttered by different speakers - participants heard 2, 6, 12 or 20
different voices in one trial (half male and half female). As in the previously
mentioned experiments, repetitions of words uttered by the same voice were
better recognized than those uttered by different voices (independent of sex).
The effect of increasing the number of different voices up to 20 had no effect
on the participants either. Thus it can be supposed that speakers do not
strategically encode voices, because otherwise the increase from 2 to 20 voices
should have impaired their ability to do so. Palmeri et al. thus propose
automatic voice encoding and that detailed voice information is part of the
representations of spoken words that are retained in long-term memory.
Goldinger then investigated how long voice-specific details remain in mem-
ory [Gol96]. To this end, he tested whether participants could still manage to
distinguish old and new words after delays of five minutes, one day and one
week. He also investigated if the advantage of words uttered by the same voice
still remains over time to prove whether same voice repetitions were still better
recognized than different voice repetitions. The results indicate that voice ef-
fects retained up to one week, but that they decrease over time (7.5 % after five
minutes, 4.1 % after one day, 1.6 % after one week). These findings indicate
that episodic traces do not only affect memory, but also influence later percep-
tion. Goldinger defines these episodic traces as “complex perceptual-cognitive
objects, jointly specified by perceptual forms and linguistic functions” [Gol96,
p. 1179]. Thus he supposes that words are recognized against a background of
detailed traces and that the mental lexicon can be viewed as episodic, which
means that memories decay.
Goldinger then tested if these traces affect not only later perception but
also speech production [Gol98]. Participants were asked to produce words
that were shown to them on a computer screen. The words had different fre-
quencies (high frequent, medium high frequent, medium low frequent and low
frequent), following the assumption that low frequent words are affected more
easily, because they are not represented by so many exemplars as high frequent
words [Pie01, Hin86]. Afterwards the participants listened to different speakers
who produced the same words several times (2, 6 or 12 repetitions). Partici-
39
3 The perception of voice
pants then heard the words again and repeated them with different delays in
an immediate or delayed shadowing task. In a following AXB-perception test
the produced baseline words (A) were compared against the shadowed word
productions (B). Listeners therefore had to judge which of these productions
sounded more similar to the stimulus words in the shadowing task (X). Re-
sults indicate that participants sounded more like the samples that they were
exposed to and that indeed low frequency words, that are heard with a high
repetition rate, invoked strong imitation in the immediate shadowing condi-
tion. Thus Goldinger supposes different degrees of imitation and that variables
such as word frequency, number of exposures and response timing also play a
role.
It is worth noting that a distinction should be made between imitation
and convergence. “Imitation is a fully conscious and controlled action in a
controlled setting, whereas convergence happens rather naturally and without
full awareness or control” [Lew12, p. 79]. Nevertheless, one can assume that
similar tendencies exist for both.
Furthermore Goldinger explored the relationship between speech imitation
and attention [Gol13]. He had participants record baseline productions of
words. Afterwards they heard a speaker uttering one of these words and they
had to click on the related picture from a collection displayed on a screen. The
experiment had different conditions - in the first condition, the competitive
objects on the screen were dissimilar in visual and phonological form, in the
second, objects were phonologically similar (e.g. beetle, beater, beaker, beach-
ball, etc.), while in the third objects were visually similar (e.g. all objects
are rounded, examples are cookie, coin, pizza, etc.). In a following AXB-
perception test listeners then rated whether the produced baseline word or the
re-recorded words were more similar to the target word. Results indicate that
imitation increased when competitors (visual, phonological) were present and
that this increase is even stronger for phonologically similar objects. Goldinger
supposes that attention to the speech signal was modulated by the challenge
of the search task. If the participants needed to carefully monitor speech to
locate the appropriate targets, they created episodic traces very rich in detail
that supported better imitation.
The above described findings support the idea that the mental lexicon is
composed of detailed traces of past experiences. Thus, not abstract units, but
rather individual exemplars are stored, which serve then as the basis of recog-
40
3.2 Voice and exemplar theory
Figure 3.1: Perception-production-loop in exemplar theory [Schw12, p. 52].When a stimulus is perceived, it is associated with similar productionsalready stored (can be regarded as category). In production a set ofstored exemplars is used to generate the production target.
nition and also for production (see Figure 3.1). Thereby each exemplar has
an associated strength. Exemplars from recent experiences and that occurred
more frequent are more vivid in memory than exemplars that are perceived
temporally remote and less frequent [Pie01].
For production Goldinger proposes that the mean of an activated set is se-
lected for production, which creates an “generic echo” [Gol97, p. 42]. Speaker-
specific information is encoded in episodic traces (next to lexical information,
e.g. words), including details about physical, psychological and social char-
acteristics (see Chapters 3.1.1-3.1.3). They are perceived and stored in the
memory of the listener and can be reused for production. The degree to which
this process is automatic, conscious and controllable is still an open question
(see Chapter 2.2). Also a “perfect” phonetic imitation of the conversational
partner’s speech is impossible due to the fact that speakers have limitations
on their abilities to do so, mainly due to the configuration of the articulatory
tract but also due to the fact that even the two productions of a single talker
are not acoustically identical [KP04, FBSW03].
According to CAT (see Chapter 2.1) accommodation then is a highly speaker-
and context-dependent process. Cooperative and social motivations, such as
approval of the conversational partner [SG82] or to arrange the conversation to
be unproblematic and smoothly [GGJ+95], lead to a more similar (speech) be-
haviour of the conversational partners. The usage-based account of exemplar
theory provides an explanation for the perception and production for imitative
phenomena. The perception of familiar voices and more frequent words lead
to more activated traces in memory and these sets of traces can then be used
for production [Gol97].
41
3 The perception of voice
The following chapter describes parameters of voice which were investigated
in this thesis. Thereby the Praat Voice Report was used for samples extracted
from the GeCo corpus of spontaneous German conversations. The leading
hypothesis for the study was that conversational partners exhibit changes in
parameters of voice while having a dialogue. The speakers thus were expected
to react to the conversational partner and to situational aspects. These reac-
tions can be mirrored in the quality of voice. The study was supposed to show
on which parameters of voice (parameters of the fundamental frequency, jitter,
shimmer, harmonics-to-noise ratio) changes occur.
42
4 Method
In Chapter 4.1 the Praat Voice Report and the voice parameters of fundamental
frequency, perturbation and harmonicity are described. Further, in Chapter
4.2, the GeCo corpus and the analysis of the corpus are presented.
4.1 The Praat Voice Report
Voice analyses in this thesis was done with Praat [BW14]. The first step was
to extract samples from each speaker of the dialogues in which speaker F was
involved. Therefore a minimum of twenty samples was manually extracted
within each, the first and the last five minutes of the dialogue. The duration
of the samples depended on the duration of the utterance - they lasted from
1.401 to 6.997 seconds. Samples mostly had a length of between 3 and 4
seconds.
Afterwards, the Praat Voice Report was produced for every single sample. In
order to compare the individual samples with each other the pitch settings were
uniformly set: the pitch range was selected between 100 Hz and 450 Hz which is
a typical range for voices of women. Additionally, the analysis method was set
to cross-correlation to avoid errors in measurement [May13, p. 142]. The Praat
Voice Report offers 26 different (calculated) voice parameters in the areas of
pitch, pulses, voicing, jitter, shimmer and harmonicity (of the voiced parts).
Seven were included in the voice analysis of this thesis. They are described
below. Values of pulses and voicing were discarded in the analysis because the
analysis was taken on spontaneous speech and thus includes phonation breaks,
i.e. when voiceless sounds were produced or the speaker took a short pause
while speaking.
43
4 Method
Fundamental frequency (F0)
When sounds in human speech are voiced, the vocal folds in the larynx vi-
brate. These vibrations can be described as a complex quasi-periodic wave
[Joh03]. The fundamental frequency (F0), measured in Hertz (Hz), is then
the number of repetitions of this complex wave per second. Furthermore, the
fundamental frequency is the first and lowest frequency in the signal. It is
called quasi-periodic because in the strict mathematical sense the waves are
not perfectly periodic due to the not perfect identical periods. Nevertheless
the waves are periodic enough for the perception of a clear sound and the
identification of the fundamental frequency [May13, p.144]. Acoustically the
fundamental frequency is correlated with the perception of pitch in the voice.
The mean fundamental frequency (F0 Mean) was in the analysis.
The standard deviation of the fundamental frequency (F0 SD) is the second
value taken in the analysis of the fundamental frequency. It shows how much
the fundamental frequency deviates from the mean fundamental frequency (F0
Mean). The minimum (F0 Min) and the maximum (F0 Max), also included
in the present analyses, describe the range of voice of a speaker. A small
range stands for little variation, i.e. monotonous speech, and a large range
represents more variation, i.e. lively speech. The standard deviation is linked
to the minimum and the maximum.
Perturbation
Due to the quasi-periodicity of the sound waves, it is common for several
deviations in the duration, frequency and amplitude to occur with respect
to their neighbouring periods (perturbations). Deviations in the fundamental
frequency are called jitter and deviations in the amplitude are called shimmer,
which are declared in percent (%). To a certain extent, jitter and shimmer
are normal in the human voice. They can also be affected through influences
on the vocal folds (e.g. smoking affects shimmer values [Bra94]). Very high
values, which can be observed in pathological voices, lead to the impression of
a breathy, rough or hoarse voices [FHE07].
Different algorithms exist in Praat and other programs to calculate values
for jitter and shimmer. In general, the values for shimmer and jitter are better
the smaller they are [May13, p. 145]. The Multi-Dimensional Voice Program
44
4.1 The Praat Voice Report
Figure 4.1: Shimmer: perturbation of amplitude, jitter: perturbation of frequency[May13, p. 144] (modified).
(MDVP) measures jitter and shimmer with equal algorithms as Praat, except
for differences in the measurement on the previous step, the detection of pe-
riods in the acoustic signal. Praat is wave-form matching, while MDVP is
peak-picking [Boe09]. This difference leads to dissimilar results. MDVP pro-
poses a threshold for jitter (Jitt) of 1.04 % for the classification of healthy
and pathological voices. This threshold is not valid for calculations done with
Praat. The value for healthy voices for jitter (Jitter (local)) here probably has
to be lower than 1.0 % [May13, p. 145] 1.
The MDVP threshold value for shimmer (Shim) 3.81 % can also be cau-
tiously taken for the Praat value Shimmer (local) [May13, p. 156]. Nawka et
al. propose a value under 2,5 % [NFG06, p. 18]. Other factors, that may play
a major role for the values of jitter and shimmer, are the recording hardware
(e.g. microphone) and the sampling rate of the signal, which can also have an
influence on the calculation of the acoustic parameters [May13, p. 140].
These values are often valid for clear and well articulated speech, e.g. the
articulation of a vowel held for a few seconds. The participants of the experi-
ment produced spontaneous speech and thus the values for jitter and shimmer
are not expected to match the thresholds for pathological voices, but to be
higher instead.
The Praat Voice Report offers five different values for jitter (Jitter (local),
Jitter (local, absolute), Jitter (rap), Jitter (ppq5), Jitter (ddp)) and six different
values for shimmer (Shimmer (local), Shimmer (local, dB), Shimmer (apq3),
Shimmer (apq5), Shimmer (apq11), Shimmer (dda)). The difference between
1Mayer proposes values for healthy voices between 0.5 and 1 % [May13, p. 145],Nawka et al. propose values between 0.1 and 1 % [NFG06, p. 18].
45
4 Method
these values stems from their calculation with different algorithms. It is unclear
which value describes the phenomena best. For this thesis the values of Jitter
(local) and Shimmer (local) were chosen. Jitter(local) is the average absolute
difference between consecutive periods, divided by the average period [Boe11],
Shimmer(local) is the average absolute difference between the amplitude of the
consecutive periods, divided by the average amplitude [Boe03].
Harmonicity
Harmonicity is defined as the proportion of harmonic and non-harmonic (noisy)
parts of a signal [May13, p. 145]. Based on the assumption that an acoustic
signal can be divided into harmonic and non-harmonic parts, the noise-to-
harmonics ratio (NHR) (proportion of non-harmonics to harmonics) and the
harmonics-to-noise ratio (HNR) (proportion of harmonics to non-harmonics)
can be calculated. Bigger parts of non-harmonics are often perceived as aspi-
rated speech and hoarseness [YSO84, YWB82].
Next to NHR and HNR the Praat Voice Report gives the value of mean
autocorrelation, which measures the similarity between neighbouring periods
and returns the probability of their agreement [May13, p. 145]. In this thesis
the value of HNR (in decibel dB) is used, i.e. the degree of periodicity. The
threshold value for the classification of healthy and pathological voices is under
20 dB [MS08, p. 26]. A disadvantage of the HNR measure is the dependency
on minimal perturbations of frequency and amplitude (jitter and shimmer)
[Mur99, FMK98] 2.
4.2 Corpus analysis
The German conversations corpus (GeCo) was recorded by Schweitzer and
Lewandowski from the Institute for Natural Language Processing of the Uni-
versity of Stuttgart in 2013 [SL13, SLD14]. Previously unacquainted women
were asked to have a conversation with each other about topics of their choos-
ing. All of them were native speakers of German. The dialogues were then
recorded for about 25 minutes in a sound-attenuated room under two different
2Yumoto et al. suggested that jitter contributes to the magnitude of the noise componentsin the harmonics-to-noise-ratio (HNR) [YSO84].
46
4.2 Corpus analysis
conditions: a unimodal and a multimodal condition. In the unimodal condi-
tion the participants were not able to see each other but they were able to
listen and to talk to their conversational partner via head-set microphones.
In the second round, the multimodal condition, the participants were able to
see each other through a transparent screen. The analyses for this thesis were
done on the multi-modal condition of the dialogues, due to the fact that that
multimodality is known to increase convergence even in the speech modality
[DR10, SL13, Bab12]. Each of the the eight participants then had a total of
six dialogues with other participants. Thus a total of 24 dialogues was pro-
duced. For the analyses of this thesis the dialogues with speaker F and her
conversational partners were chosen (six dialogues, see Figure 4.2).
F J
KA
C
D H
Figure 4.2: Dialogues of speaker F with six different conversational partners.
After the recordings the participants filled in a questionnaire in which they
rated their conversational partners [SL13]. They had to make statements about
their impressions of the partner’s social attractiveness (likeable, kind, social,
relaxed) and her competence (intelligent, competent, successful, self-confident)
on a 5-point Likert scale. These values were then transformed to values from
-2 and 2 and added afterwords to a composite overall likeability score with a
range from -8 to 8. The values for social attractiveness and competence are
included in the discussion of results (Chapter 6.2.1).
For the analysis, samples of each speaker from each of the chosen dialogues
(every dialogue with speaker F) were manually extracted. Therefore a min-
imum of twenty samples were taken from each the first five minutes of the
dialogue as well as the last five minutes. This was done in order to compare
the voice quality from the beginning and the end of the dialogue. Samples
47
4 Method
which contained laughter, breathing or very glottalized speech were discarded.
Afterwards a Praat Voice Report was generated from each sample. The data
was then entered in IBM SPSS Statistics 21 [Inc12].
In order to detect outliers boxplots were generated (see Figure 4.3). Outliers
are defined as values that fall more than 1.5 to 3 box lengths from the upper
or lower hinge of the box. They are marked by a circle (◦). Extreme outliers
then are values that are more than 3 box lengths away from either hinge of
the box, marked by a star (∗) [JL13, p. 242]. Table 4.1 shows the sum of
the outliers and extreme outliers found for the beginnings and ends of the
individual speakers.
Figure 4.3: The boxplot represents the values of the minimal fundamental fre-quency from the beginning of the dialogue of speaker F with speakerJ. It includes the sample 2 with the error in measurement (see Figure6.1) which is classified as an extreme outlier.
Overall, extreme outliers were discarded in the analyses. A different proce-
dure in sorting out the outliers was made in the sample collections for each
speaker in the individual dialogues (about 20 samples for beginning/end for
each speaker) and the sample collection of speaker F (about 130 samples for
beginning/end). This means that some samples that are discarded for speaker
F in the individual dialogues are included in the sample collection of speaker F
for all dialogues. The reason for discarding outliers separately was that speaker
F might have shown an abnormal value in one of the parameters in a single
48
4.2 Corpus analysis
dialogue, but that this value matches the values that speaker F produced in
all dialogues. Thus more accurate analyses might be achieved. An overview of
the outliers can be seen in Table 4.1 and Table 4.2. Detailed tables of outliers
are shown in Appendix A (Outliers - Individual speakers) and Appendix B
(Outliers - Speaker F).
Voice parameter F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR
Outliers (◦) 15 8 17 20 21 17 11Extreme (∗) 5 1 11 1 3 1 2outliers
Total 20 9 28 21 24 18 13
Table 4.1: Sum of outliers (identified by boxplots) from the sample collections fromthe beginnings and ends of each individual speaker. A detailed table canbee seen in Appendix A.
Voice parameter F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR
Outliers (◦) 6 5 0 11 8 2 3Extreme (∗) 0 0 0 4 1 0 0outliers
Total 6 5 0 15 9 2 3
Table 4.2: Sum of outliers (identified by boxplots) from the sample collections fromthe beginnings and ends of speaker F. Samples were taken from all sixdialogues in which speaker F was involved. A detailed table can beeseen in Appendix B.
49
5 Results
Three statistical analyses were done with IBM SPSS Statistics 21 [Inc12]. Dif-
ferences were calculated with the help of paired t-tests and ANOVAs (analysis
of variance). Thereby values below or equal 0.05 are considered to be signifi-
cant. The goals of the analyses were the following:
• Analysis 1: Individual speakers
Samples from the beginning and the end for each speaker in each dialogue
were compared with each other with the help of paired t-tests. Significant
parameters were supposed to have changed over the course of the dialogue
(according to accommodation to the conversational partner). In this
analysis the individual speakers in the different dialogues were in the
focus. The goal was to find out which voice parameters changed in each
dialogue for each speaker.
• Analysis 2: Speaker F
All samples that speaker F uttered in the beginnings of the six dialogues
were compared with all samples from the ends with the help of paired t-
tests. The goal of this analysis was to identify those parameters speaker
F changed in all dialogues she had with different conversational partners.
• Analysis 3: Position in dialogue
Samples from the conversational partners in the individual dialogues were
compared between the beginning and the end of the conversation with the
help of ANOVAs. It is supposed that some parameters have changed from
the beginning and the end of the dialogue in order of accommodation.
In the following chapters (Chapters 5.1-5.3) the results of the analyses are
described.
51
5 Results
5.1 Analysis 1 - Individual speakers
In this section, samples from the beginnings and ends of the dialogues are
compared with each other for every individual speaker within the dialogues.
Paired t-tests were used to show which parameters changed for the individual
speakers during the dialogues. In the following the speakers F and D in their
conversation are analysed. The results of the other five dialogues are shown in
the summarizing Table 5.6.
Speaker D
Table 5.1 shows the descriptive statistics for the mean values of speaker D in
the dialogue with F. In Table 5.2 the results of the paired t-tests for speaker D
at the beginning and the end of the dialogue with speaker F are presented.
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 231.44841 22 19.494032 4.156142end 232.93950 22 12.807842 2.730641
Pair 2 F0 SD beginning 20.93773 22 6.026298 1.284811end 28.94427 22 12.411122 2.646060
Pair 3 F0 Min beginning 182.06523 22 18.481850 3.940344end 154.76977 22 50.631920 10.794762
Pair 4 F0 Max beginning 313.69659 22 31.624643 6.742397end 330.85964 22 31.889633 6.798893
Pair 5 Jitter beginning 1.83905 22 0.375192 0.079991end 1.85718 22 0.377235 0.080427
Pair 6 Shimmer beginning 7.05686 22 1.272993 0.271403end 6.75695 22 0.938097 0.200003
Pair 7 HNR beginning 18.75591 22 1.531032 0.326417end 18.39091 22 1.756843 0.374560
Table 5.1: Mean, standard deviation (SD) and standard error mean for the com-parison of the samples of speaker D from the beginnings and ends of thedialogue with speaker F.
Results show that speaker D changed on the variability of F0 (see Table 5.2)
as significant changes were found for F0 SD (0.021), F0 Min (0.029) and F0
Max (0.049). The mean values (see Table 5.1) show that F0 Min decreased
between beginning and end (from 182.07 Hz to 154.77 Hz) and F0 Max in-
creased (from 313.70 Hz to 330.86 Hz). As a consequence also the value of F0
SD increased from (20.94 Hz to 28.94 Hz).
52
5.1 Analysis 1 - Individual speakers
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 Mean -1.491091 18.110810 3.861238 -0.386 21 0.703Pair 2 F0 SD -8.006545 15,100058 3.219343 -2.487 21 0.021Pair 3 F0 Min 27.295455 54.519820 11.623665 2.348 21 0.029Pair 4 F0 Max -17.163045 38.508415 8.210022 -2.090 21 0.049Pair 5 Jitter -0.018136 0.422668 0.090113 -0.201 21 0.842Pair 6 Shimmer 0.299909 1.602436 0.341641 0.878 21 0.390Pair 7 HNR 0.365000 2,283732 0.486893 0.750 21 0.462
Table 5.2: Results of the paired t-tests, comparing samples of speaker D from thebeginning and the end. Significant changes were found for F0 SD, F0Min and F0 Max.
Speaker F
The results of speaker F, comparing samples from the beginning and the ends of
the conversation with speaker D, are shown in Table 5.4. Similar as for speaker
D the values of F0 SD (0.000) and F0 Min (0.001) were shown to be highly
significant. Mean values from Table 5.3 show that F0 Min increased from the
beginning to the end (from 97.03 Hz to 133.15 Hz) and (as a consequent effect)
F0 SD decreased (from 39.26 Hz to 22.53 Hz).
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 217.70765 20 19.166114 4.285673end 210.76340 20 21.706531 4.853728
Pair 2 F0 SD beginning 39.25525 20 11.712071 2.618899end 22.53040 20 8.312417 1.858713
Pair 3 F0 Min beginning 97.02540 20 6.125833 1.369778end 133.15010 20 37.985685 8.493857
Pair 4 F0 Max beginning 321.69335 20 60.458571 13.518947end 295.60210 20 61.244963 13.694790
Pair 5 Jitter beginning 1.84015 20 0.447757 0.100121end 1.94725 20 0.447375 0.100036
Pair 6 Shimmer beginning 7.18625 20 1.548080 0.346161end 6.81760 20 1.302867 0.291330
Pair 7 HNR beginning 20.02490 20 1.841146 0.411693end 20.26540 20 1.821685 0.407341
Table 5.3: Mean, standard deviation (SD) and standard error mean for the com-parison of the samples of speaker F from the beginnings and ends of thedialogue with speaker D.
53
5 Results
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 Mean 6.944250 22.846937 5.108730 1.359 19 0.190Pair 2 F0 SD 16.724850 13.373057 2.990306 5.593 19 0.000Pair 3 F0 Min -36.124700 41.604984 9.303157 -3.883 19 0.001Pair 4 F0 Max 26.091250 69.397917 15.517846 1.681 19 0.109Pair 5 Jitter -0.107100 0.600917 0.134369 -0.797 19 0.435Pair 6 Shimmer 0.368650 2.094771 0.468405 0.787 19 0.441Pair 7 HNR -0.240500 2.701613 0.604099 -0.398 19 0.695
Table 5.4: Results of the paired t-tests, comparing samples of speaker F from thebeginning and the end within the dialogue with speaker D. Significantchanges of F0 SD and F0 Min were found.
Table 5.5 shows a summary of the results of the paired t-tests compar-
ing samples from the beginning of speaker D and speaker F. Both exhibited
changes in the values of F0, namely F0 SD and F0 Min. Speaker D also showed
significant changes in F0 Max. While F0 variation increased for speaker D, it
decreased for speaker F.
Speaker F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR
D ∗ ∗ ∗F ∗ ∗
Table 5.5: Significant changes of voice correlates for the speakers D and F in theirdialogue. Significant values are marked by ∗.
54
5.1 Analysis 1 - Individual speakers
Summary of the results
Table 5.6 shows a summary of the results of the paired t-tests for all dialogues.
Samples from the beginning and the end for each speaker within the single
dialogues were compared. Detailed results of the analyses can be found in
Appendix C.
Speaker F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR
A (∗) ∗F ∗ ∗ ∗CF (∗) ∗D ∗ ∗ ∗F ∗ ∗H ∗F
J (∗) (∗) ∗ ∗F (∗)
KF ∗ ∗
Table 5.6: Significant changes of voice correlates for each speaker within the singledialogues. Significant values are marked by ∗, tendencies are marked by(∗).
In the dialogue between the speakers A and F HNR showed to be signifi-
cant (0.026) for speaker A. From the beginning to the end of the dialogue it
increased from 15.71 dB to 16.79 dB. Shimmer exhibited a tendency towards
being significant (0.066). It decreased from 9.50 % to 8.71 %. Speaker F had
significant changes in the values of F0, namely F0 SD (0.004), F0 Min (0.020)
and F0 Max (0.007). As F0 Min increased (from 129.02 to 150.67 Hz) and F0
Max decreased (from 299.24 Hz to 266.68 Hz), F0 SD decreased (from 27.24 Hz
to 16.95 Hz). The conversational partners did not exhibit changes on the same
parameters. Detailed results of the paired t-tests can be seen in Appendix
C.1.
Speaker C in the dialogue with speaker F showed no significant changes at
all. Speaker F changed on values of F0. F0 Min changed significantly (0.004)
as it decreased (from 131.65 Hz to 101.66 Hz) and F0 Mean showed a tendency
towards being significant (0.061) as it also decreased (from 203.54 Hz to 194.65
Hz). In Appendix C.2 detailed results are presented.
55
5 Results
In the dialogue of speaker H and F only speaker F changed significantly on
the parameter of F0 SD (0.037) as it decreased (from 15.51 Hz to 14.28 Hz).
Detailed results of the paired t-tests are shown in Appendix C.3.
Speaker J in the dialogue with F exhibited changes on the values of F0. F0
Mean (0.054) and F0 SD (0.053) showed a tendency towards being significant
and F0 Min (0.007) and F0 Max (0.032) proved to be significant. As F0 Min
increased (from 121.85 Hz to 147.47 Hz) and F0 Max decreased (from 352.53 Hz
to 324.47 Hz), F0 SD decreased (35.58 Hz to 29.16 Hz). F0 Mean also decreased
from (229.59 Hz to 219.89 Hz). Speaker F, in the same dialogue, exhibited a
tendency for a significant change on the parameter of F0 Min (0.056). It
decreased (from 163.30 Hz to 141.98 Hz). Appendix C.4 shows detailed results
for the values of speaker J and F.
In the dialogue between the speakers K and F only speaker F significantly
changed on F0 SD (0.014) and HNR (0.044). F0 SD decreased (from 24.11 Hz
to 18.62 Hz) and HNR increased (19.18 db to 20.46 Hz). Detailed results are
presented in Appendix C.5
The results of the analyses of the single speakers within dialogues, summa-
rized in Table 5.6, show that there is on overall tendency for F0 towards being
significant, especially F0 SD and F0 Min (as they both have been significant
five times each and showed a tendency towards being significant once). Also
there have been significant changes in F0 Max (three times) and HNR (two
times). Shimmer and F0 Mean showed a few tendencies towards being sig-
nificant (F0 Mean two times, Shimmer one time), whereas jitter was never
significant. Conversational partners did not changed on the same voice pa-
rameters (e.g. in the dialogue of the speakers A and F). In addition, speaker
F, involved in all six dialogues, mainly changed on F0, but did not change in
every dialogue.
5.2 Analysis 2 - Speaker F
The second analysis was done with the help of paired t-tests. Samples from
the beginning and end from speaker F from all dialogues were taken (resulting
in 137 samples for the beginning and 129 for the end; less discarded extreme
outliers). The descriptive statistics in Table 5.7 shows the mean values for
each voice parameter for the beginnings and ends for all six dialogues in which
56
5.2 Analysis 2 - Speaker F
speaker F was involved. The results of the paired t-tests can be seen in Ta-
ble 5.8 and a summary of the results in Table 5.9. Calculating the difference
between voice parameters from the beginning and the end showed on which
voice parameters speaker F changed.
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 209.5415 129 17.53992 1.54430end 202.31087 129 17.387400 1.530875
Pair 2 F0 SD beginning 26.4872 129 12.52197 1.10250end 19.70573 129 8.460571 0.744912
Pair 3 F0 Min beginning 133.6174 129 34.14449 3.00625end 140.74980 129 33.644801 2.962260
Pair 4 F0 Max beginning 293.9802 125 45.59580 4.07821end 267.53874 125 27.025453 2.417230
Pair 5 Jitter beginning 2.0738 128 0.51661 0.04566end 2.04703 128 0.547375 0.048382
Pair 6 Shimmer beginning 7.4931 129 1.51279 0.13319end 7.29301 129 1.617199 0.142386
Pair 7 HNR beginning 19.3451 129 2.17308 0.19133end 19.77159 129 2.184534 0.192338
Table 5.7: Mean, standard deviation (SD) and standard error mean for the com-parison of the samples of speaker F from the beginnings and the endsof the dialogues. All values from all six dialogues were taken.
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 7.230597 23.929435 2.106869 3.432 128 0.001Pair 2 F0 SD 6.781465 13.449280 1.184143 5.727 128 0.000Pair 3 F0 Min -7.132411 47.969186 4.223451 -1.689 128 0.094Pair 4 F0 Max 26.441448 49.541430 4.431120 5.967 124 0.000Pair 5 Jitter 0.026789 0.735625 0.065021 0.412 127 0.681Pair 6 Shimmer 0.200124 2.199984 0.193698 1.033 128 0.303Pair 7 HNR -0.426535 3.063233 0.269703 -1.582 128 0.116
Table 5.8: Results of the paired t-tests, comparing parameters from samples fromthe beginning and end of speaker F of all dialogues. F0 Mean, F0 SDand F0 Max have shown to be significant.
Speaker F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR
F ∗ ∗ ∗
Table 5.9: Significant changes of voice correlates for speaker F. Significant valuesare marked by ∗.
The overall analyses for speaker F showed that she changed significantly on
F0, namely F0 Mean (0.001), F0 SD (0.000) and F0 Max (0.000) (all values
57
5 Results
decreased). The parameters of F0 Min, Jitter, Shimmer and HNR did not
prove to be significant.
5.3 Analysis 3 - Position in dialogue
The following analysis shows the influence of the position (beginning and end)
of the dialogue. Therefore one-way ANOVAs (analysis of variance) were con-
ducted. For every single voice parameter an analysis was done, in which the
voice parameter was the dependent variable and the position (beginning or end
of the dialogue) the fixed factor. Detailed results are shown for the dialogue
of the speakers C and F. The other results are then presented in a summary.
Detailed values of the ANOVAs are shown in Appendix D.
Dialogue C-F
Table 5.10 shows the mean values (Mean), the standard deviation (SD) and the
Standard Error Mean (St. Error Mean) for the mean fundamental frequency
measurements of the dialogue of the speakers C and F.
F0 Mean
Position Speaker Mean N SD Std. Error Mean
beginning C 216.67610 29 16.455194 3.055653beginning F 203.86271 21 15.634375 3.411700
ending C 218.16552 29 15.574690 2.892147ending F 194.73200 20 9.603569 2.147423
Table 5.10: Mean, Standard deviation (SD) and Standard Error Mean for the meanfundamental frequency (F0 Mean) in Hertz (Hz) of the speakers C andF at the beginning and at the end of the dialogue.
The mean values of F0 Mean for both speakers, speaker C and speaker F, did
not vary much from the beginning and end in their dialogue. Speaker C showed
an F0 Mean value of 216.77 Hz at the beginning and a value of 218.17 Hz at
the end, which is a minimal change of 1.4 Hz. Speaker F had a F0 Mean
value of 203.86 Hz at the beginning and a value of 194.73 Hz at the end of the
dialogue. The difference was 9.13 Hz.
58
5.3 Analysis 3 - Position in dialogue
Source Type III df Mean F Sig.Sum of Square
Intercept 4321380.850 1 4321380.850 14043.360 0.000Position 189.073 1 189.073 0.614 0.435Error 29540.834 96 307.717
Table 5.11: Results of the ANOVA for F0 Mean for the speakers C and F at thebeginning and at the end of the dialogue.
The result of the ANOVA in Table 5.11 show that the values for F0 Mean
did not changed significantly from the beginning and the end of the dialogue
(0.435).
F0 SD
Table 5.12 shows mean values (Mean), the standard deviation (SD) and the
Standard Error Mean (St. Error Mean) for the standard deviation of F0 SD.
Position Speaker Mean N SD Std. Error Mean
beginning C 33.42803 29 11.400413 2.117004beginning F 21.17795 22 7.457711 1.589989
ending C 32.05234 29 10.207541 1.895493ending F 21.98543 21 8.624827 1.882092
Table 5.12: Mean, Standard deviation (SD) and Standard Error Mean for the stan-dard deviation of the fundamental frequency (F0 SD) in Hertz (Hz) ofthe speakers C and F at the beginning and at then end of the dialogue.
Speaker C showed a mean value for F0 SD of 33.43 Hz at the beginning of
the dialogue and a value of 32.05 Hz at the end of the dialogue. The difference
of 1.38 Hz was minimal. Speaker F showed a value of 21.18 Hz at the beginning
and a value of 21.99 Hz at the end of the dialogue. Also here the difference of
0.81 Hz was minimal.
Source Type III df Mean F Sig.Sum of Square
Intercept 78248.202 1 78248.202 622.294 0.000Position 2.209 1 2.209 0.018 0.895Error 12322.672 98 125.742
Table 5.13: Results of the ANOVA for F0 SD for the speakers C and F at thebeginning and at the end of the dialogue.
59
5 Results
Table 5.13 shows the results of the ANOVA for the dialogue of C and F.
No significant changes were found for F0 SD for the beginning and end of the
dialogue (0.895).
F0 Min
Table 5.14 shows the mean values (Mean), the standard deviation (SD) and
the Standard Error Mean (Std. Error Mean) for F0 Min of the speakers C and
F at the beginning and end of their dialogue.
Position Speaker Mean N SD Std. Error Mean
beginning C 125.65093 29 30.399702 5.645083beginning F 129.93255 22 37.79275 8.057442
ending C 113.03931 26 27.499529 5.393101ending F 101.65800 18 15.192292 3.580858
Table 5.14: Mean, Standard deviation (SD) and Standard Error Mean for the min-imal fundamental frequency (F0 Min) in Hertz (Hz) of the speakers Cand F at the beginning and end of the dialogue.
At the beginning of the dialogue speaker C exhibited a mean value of
125.65 Hz and at the end a value of 113.04 Hz. The value decreased about
12.61 Hz. Speaker F showed a value of 129.93 Hz at the beginning and a value
of 101.66 Hz at the end of the dialogue. The values lowered about 28.27 Hz.
Source Type III df Mean F Sig.Sum of Square
Intercept 1280665.126 1 1280665.126 1647.763 0.000Position 11479.017 1 11479.017 14.769 0.000Error 71503.743 92 777.215
Table 5.15: Results of the ANOVA for F0 Min for the speakers C and F at thebeginning and at the end of the dialogue.
Table 5.14 shows the results of the ANOVA. The parameter of F0 Min has
in the dialogue of the speakers C and F has shown to be highly significant
(0.000).
F0 Max
Table 5.16 shows the mean value (Mean), the standard deviation (SD) and the
Standard Error Mean (Std. Error Mean) for F0 Max.
60
5.3 Analysis 3 - Position in dialogue
Position Speaker Mean N SD Std. Error Mean
beginning C 321.76286 28 37.976402 7.176865beginning F 285.37145 22 40.874082 8.714383
ending C 319.56734 29 6.145114 33.092452ending F 266.61476 21 6.128564 28.084610
Table 5.16: Mean, Standard deviation (SD) and Standard Error Mean for the max-imal fundamental frequency (F0 Max) in Hertz (Hz) of the speakers Cand F at the beginning and end of the dialogue.
Speaker C’s mean values for F0 Max decreased minimally about 2.21 Hz
from 321.76 Hz at the beginning to 319.57 Hz at the end of the dialogue.
Speaker F showed a value from 285.37 Hz at the beginning of the dialogue and
a value of 266.61 Hz at the end. The decrease was by 18.76 Hz.
Source Type III df Mean F Sig.Sum of Square
Intercept 9092573.837 1 9092573.837 5213.073 0.000Position 1773.833 1 1773.833 1.017 0.316Error 170930.316 98 1744.187
Table 5.17: Results of the ANOVA for F0 Max for the speakers C and F at thebeginning and at the end of the dialogue.
Table 5.17 shows the results of the ANOVA for F0 Max for the speakers C
and F at the different positions of the dialogue. The parameter did not show
to be significant (0.316).
Jitter
Table 5.18 shows values for jitter at the beginning and end of the dialogue
of the speakers C and F. Values are shown for the mean (Mean), standard
deviation (SD) and standard error mean (Std. Error Mean).
Position Speaker Mean N SD Std. Error Mean
beginning C 2.26893 28 0.419790 0.079333beginning F 2.00895 22 0.557339 0.118825
ending C 2.46824 29 0.813099 0.150989ending F 2.21362 21 0.450859 0.098386
Table 5.18: Mean, Standard deviation (SD) and Standard Error Mean for the Jit-ter in percent of the speakers C and F at the beginning and end thedialogue.
61
5 Results
Speaker C and F both exhibited mean values about 2.2 % for jitter. Speaker
C showed a value of 2.27 % at the beginning of the dialogue and a small increase
of 0.2 % to the end, leading to a value of 2.47 %. Speaker F also shows a small
increase for the mean value of 0.2 % from 2.01 % at the beginning to 2.21 %
at the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 499.668 1 499.668 1415.236 0.000Position 1.300 1 1.300 3.682 0.058Error 34.247 97 0.353
Table 5.19: Results of the ANOVA for jitter for the speakers C and F at the be-ginning and at the end of the dialogue.
Table 5.19 shows the results of the ANOVA made for jitter in the dialogue
of the speakers C and F at the beginning and end. The result shows that there
is a tendency for the changes in jitter to be significant (0.058).
Shimmer
Table 5.20 shows the mean values (Mean), the standard deviation (SD) and
the standard error mean (Std. Error Mean) for shimmer in the dialogue of C
and F at the beginning and end of their dialogue.
Position Speaker Mean N SD Std. Error Mean
beginning C 7.67610 29 1.395318 0.259104beginning F 7.69191 22 1.956215 0.417067
ending C 8.37276 29 2.160529 0.401200ending F 7.31910 21 1.224337 0.267172
Table 5.20: Mean, Standard deviation (SD) and Standard Error Mean for the shim-mer in percent of the speakers C and F at the beginning and end thedialogue.
Both speakers have shimmer values about 7 and 8 %. Speaker C had in
increase of 0.69 % from 7.68 % at the beginning of the dialogue to 8.37 %
at the end. Speaker F ’s shimmer value minimally decreased from 7.69 % to
7.32 % about 0.37 %.
62
5.3 Analysis 3 - Position in dialogue
Source Type III df Mean F Sig.Sum of Square
Intercept 6052.451 1 6052.451 1965.702 0.000Position 2.264 1 2.264 0.735 0.393Error 301.745 98 3.079
Table 5.21: Results of the ANOVA for shimmer for the speakers C and F at thebeginning and at the end of the dialogue.
Table 5.21 shows the results of the ANOVA done for the dialogue of the
speakers C and F at the beginning and end of their dialogue. Shimmer has
not proven to be significant (0.393).
HNR
Table 5.22 shows the mean values (Mean), standard deviation (SD) and stan-
dard error mean (St. Error Mean) of HNR for the speakers C and F at the
beginning and end of their dialogue.
Position Speaker Mean N SD Std. Error Mean
beginning C 19.11282 28 2.104718 0.397754beginning F 20.21164 22 2.538209 0.541148
ending C 18.95059 29 3.150715 0.585073ending F 20.03333 21 1.729479 0.377403
Table 5.22: Mean, standard deviation and standard error mean for the harmonics-to-noise ratio (HNR) in decibel (dB) of the speakers C and F at thebeginning and end the dialogue.
Speaker C shows a small decrease of 0.16 dB from 19.11 dB at the beginning
to 18.95 dB at the end of the dialogue. Speaker F also exhibited a small
decrease of 0.18 dB from 20.21 dB at the beginning and 20.03 dB at the end
of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 38028.198 1 38028.198 5993.214 0.000Position 0.912 1 0.912 0.144 0.705Error 621.831 98 6.345
Table 5.23: Results of the ANOVA for harmonics-to noise ratio (HNR) for thespeakers C and F at the beginning and at the end of the dialogue.
63
5 Results
Table 5.23 shows the results of ANOVA for HNR for the beginning and end
of the dialogue of the speakers C and F. HNR has not proven to be significant
(0.705).
Summary of the results
A summary of the ANOVAs for all dialogues can be seen in Table 5.24. Detailed
results can be seen in Appendix D.
Dialogue partners F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR
A & F ∗C & F ∗ (∗)D & FH & FJ & F (∗) ∗K & F
Table 5.24: Results done with ANOVAs for all dialogues at the beginning and endfor the voice parameters of F0 Mean, F0 SD, F0 Min, F0 Max, Jitter,Shimmer and HNR.
In the dialogue between speakers A and F F0 SD has shown to be significant
(0.036). At the beginning of the dialogue speaker A had a value of 40.96 Hz and
a minimal decrease of 3.41 Hz to 37.54 Hz at the end of the dialogue. Speaker
F showed a strong decrease about 10.36 Hz from 27.31 Hz at the beginning to
16.95 Hz at the end of the dialogue. Detailed results of the analyses can be
seen in Appendix D.1
No significant changes of parameters were found for the dialogues of speaker
F with the speakers D, H and K. Appendices D.2, D.3 and D.5 show detailed
results of the analyses.
In the dialogue between the speakers J and F F0 Mean showed a tendency
for being significant (0.062). Speaker J showed a value of 229.51 Hz at the
beginning of the dialogue and a decrease of 9.62 Hz to 219.89 Hz at the end of
the dialogue. Speaker F showed a small decrease of 1.01 Hz from 211.08 Hz at
the beginning to 210.07 Hz at the end of the dialogue. Also the parameter of
F0 Max proved significant (0.041). At the beginning of the dialogue speaker
F showed a value of 349.95 Hz and a strong decrease of 25.48 Hz to 324.47 Hz
at the end. Speaker F exhibited a small decrease of 3.49 Hz from 276.54 Hz
at the beginning to 273.05 Hz at the end. Detailed results can be found in
Appendix D.4.
64
6 Discussion
In the present chapter the method used (Chapter 6.1) and the results (Chapter
6.2) are discussed. Afterwards, F0 is considered, as most significant changes of
the analyses occurred on features of F0 (Chapter 6.3). As other voice param-
eters were found to be significant in other studies, the influence of temporal
aspects are discussed (Chapter 6.4). At the end the influence of engagement
is examined as engagement rather than rapport has been presumed to have an
influence on accommodation/convergence (Chapter 6.5).
6.1 Discussion of method
The method of the Praat Voice Report analyses various parameters, including
those of F0, perturbation and harmonicity. For F0, the mean, the standard
deviation, the minimum and the maximum were taken for the analysis, which
gives a global view of the average and variation of the fundamental frequency.
Jitter and shimmer were taken for the analysis of perturbation in voice, so
that quality and irregularities of the frequency and the amplitude can be deter-
mined. From the different possible algorithms Jitter(local) and Shimmer(local)
were chosen. For measures of harmonicity the measure of harmonics-to-noise
ratio (HNR) was taken, which mirrors the degree of acoustic periodicity.
With the help of boxplots, generated with the help of SPSS [Inc12], ex-
treme outliers were discarded for the analyses. Notably, most extreme outliers
occurred for the minimum of the fundamental frequency for the individual
speakers (see Table 4.1, for a detailed table see Appendix A). An explanation
for that could be errors in measurements, as happened, for instance, for sam-
ple 2 from speaker F (see Figure 6.1). In this sample the Praat Voice Report
gave an F0 Min of 88.68 Hz. Due to the glottalization of some vowels in her
utterance the pitch was calculated incorrectly, which led to false values which
would have negatively influenced the analysis.
65
6 Discussion
Figure 6.1: Error of measurement that caused an extreme outlier in F0 Min.Sample 2 from speaker F from the beginning of the dialogue with J.(Utterance: “Ja, nee, ich hab hospitiert, also hinten drin beobachtetund vorbereitet und so, aber man redet ja immer trotzdem ziemlichviel.”)
Figure 6.2: Error of measurement in Praat that caused a wrong calculation of F0Max (and also F0 Min). Utterance: “Ja, genau, da wurd ich, wo ichumgekippt bin.”
The values of the maximum fundamental frequency should also be regarded
cautiously as errors in measurements can occur. In the boxplots made for
speaker F (for the beginnings and ends of all the dialogues she was involved
in) most outliers occurred for the maximum of the fundamental frequency
(see Table 4.2, for a detailed table see Appendix B). An example of a false
calculation can be seen in Figure 6.2, in which a false maximum fundamental
66
6.1 Discussion of method
frequency was calculated (379.18 Hz) due to glottalization.
Therefore values for the minimum and maximum fundamental frequency
should be read with caution. Overall, it can be assumed that errors in mea-
surement were not included in the analyses due to discarding extreme out-
liers.
Additionally, analyses of voices are usually done for held vowels [May13,
p. 169], mostly for the analysis of pathological voices. The present analyses
were done for samples of spontaneous speech, thus it can be expected that
values for jitter, shimmer and harmonics-to-noise ratio are not beneath the
threshold values for healthy and pathological voices set for these parameters.
Nevertheless the measured values have been constant in the study, as can be
seen in Table 6.1.
The values of jitter vary between 1.857 % and 2.533 % (see Table 6.1, which is
higher than the threshold for pathological voices beneath 1 % [May13, p. 145],
and the values for shimmer lay between 6.757 % and 9.512 %, which is also
much higher than the thresholds of 2.5 % [NFG06, p. 18] and 3.81 % [May13,
p. 156]. The values for the harmonics-to-noise ratio vary between 15.715 dB
and 19.772 dB and lie beneath the threshold for pathological voices of 20 dB
[MS08, p. 26].
As measured values for jitter, shimmer and HNR have been constant across
the analyses (see Table 6.1), although they cannot be compared to those mea-
sured in held vowels, it can be assumed that the Praat Voice Report is an ap-
propriate measure for spontaneous speech, even if not all parameters are useful.
Voicing parameters, i.e. pauses, occur more often in spontaneous speech than
in held vowels, such as in the usage of plosives in the speech stream or short
pauses, e.g. for breath taking.
A parameter that would also be interesting to include in the analysis of voice
would be vocal intensity, thus the analysis of amplitudes (as it is partly done
with the parameter of shimmer). As vocal intensity is perceived as loudness,
the question arises if speakers might converge in the height or form of ampli-
tudes. In the experiment of De Looze and colleagues [LORC11] voice intensity
was found to become more similar between conversational partners (among
other parameters). Analyses have also been done for individual words with
amplitude envelopes [Lew12] as a measure of spectral similarity.
67
6 Discussion
Parameter Speaker Position Mean Standard deviation N
Jitter F beginning 2.077 0.503 136end 2.047 0.545 129
A beginning 2.517 0.590 24end 2.315 0.378 23
C beginning 2.269 0.420 28end 2.468 0.813 29
D beginning 1.910 0.414 25end 1.857 0.377 22
H beginning 2.247 0.574 25end 2.415 0.708 22
J beginning 2.302 0.509 26end 2.173 0.292 23
K beginning 2.280 0.510 22end 2.533 0.539 21
Shimmer F beginning 7.482 1.497 137end 7.293 1.617 129
A beginning 9.504 1.457 24end 8.079 1.416 24
C beginning 7.676 1.395 29end 8.372 2.161 29
D beginning 7.088 1.194 25end 6.757 0.938 22
H beginning 7.102 1.296 25end 7.732 1.709 22
J beginning 9.291 1.350 26end 9.512 1.721 24
K beginning 7.164 1.376 22end 7.484 1.097 21
HNR F beginning 19.354 2.129 137end 19.772 2.185 129
A beginning 15.715 1.725 24end 16.789 1.438 24
C beginning 19.113 2.105 28end 18.951 3.151 29
D beginning 18.450 1.683 25end 18.391 1.757 22
H beginning 19.063 1.634 25end 18.905 1.526 22
J beginning 16.459 1.905 26end 16.726 2.049 24
K beginning 18.487 1.899 22end 17.849 1.584 21
Table 6.1: Mean and standard deviation values for jitter, shimmer and harmonics-to-noise ratio of speaker F and her conversational partners, measuredfrom samples from the beginnings and end of the six dialogues. Jitterand shimmer are measured in percent (%), the harmonics-to-noise ratio(HNR) in decibel (dB).
68
6.2 Discussion of results
6.2 Discussion of results
In the following the results of the three analyses are discussed (Chapter 5.1-
5.3). For the first analysis (Individual dialogues) the speakers’ ratings about
their conversational partner are included. Afterwards the results of Analysis
2 (Speaker F) and Analysis 3 (Position in dialogue) are treated.
6.2.1 Discussion of Analysis 1 - Individual speakers
Analysis 1 (Individual speakers) revealed that speakers mainly changed on F0
and that speakers behaved differently, e.g. speaker F did not exhibit changes
in every dialogue and speakers seem to change or maintain in parameters,
independent of whether their conversational partner does. In the following
the results of Analysis 1 are discussed. Ratings of the speakers about their
conversational partner’s social attractiveness and competence (each on a scale
from -8 to 8) are included.
Dialogue A-F
In the dialogue between the speakers A and F, both showed significant changes
for different parameters: speaker F changed on the values of F0 SD, F0 Min
and F0 Max (which led to little variation and thus monotonous speech), while
speaker A changed on HNR (sounding more hoarsely) and exhibited a tendency
for shimmer being significant (leading to a less breathy or rough voice). As the
significant values of speaker F departed from those of speaker A (see Appendix
C.1), divergence can be assumed for speaker F. Speaker A’s values, on the
other hand, approached those of speaker F. Thus for her convergence can be
assumed.
The ratings from speaker F about speaker A do not agree with the above
described parameter values as the ratings are quite high (both 6). On the other
hand, speaker A’s values of HNR and shimmer approached those of speaker
F, which leads to the assumption that she converged towards speaker F. The
high rating for social attractiveness (8) confirms this result.
69
6 Discussion
Dialogue C-F
In the dialogue between the speakers C and F only speaker F changed: signif-
icant results were found for F0 Min and a tendency towards being significant
for F0 Mean. For the minimum of F0 both speakers lowered their values (the
lowering led to more variability in the voice), speaker F more than speaker C
(see Appendix C.2). This can be interpreted differently: as both speakers show
the same behaviour as they decrease their values, they are acting synchronous
and this can be interpreted as convergence. Another explanation might be
that both speakers diverged due to the fact that the distance between the
values from the beginning and those from the end increased. The values for
F0 Mean of speaker F can be interpreted as divergence as the lowering of the
value (perceived as a deeper voice) leads to an increase of the distance between
the conversational partners.
As both speakers rated each other with quite positive values for social attrac-
tiveness and competence (F about C: 5 and 5, C about F: 4 and 5) convergence
for F0 SD by acting synchronously can be assumed. Another explanation might
be that both largely maintained as speaker F only significantly changed on one
parameter.
Dialogue D-F
Significant parameters in the dialogue of the speakers D and F were F0 SD as
well as F0 Min for both speakers and additionally F0 Max for speaker D. The
values of F0 Min can be interpreted as convergence as speaker D is decreasing
in her values towards speaker F (and thus exhibits more variance in her speech)
and speaker F is increasing in her values (less variance) towards speaker D (see
Chapter 5.1).
The values for F0 Max, on the other hand, can be interpreted as divergence
as they drift apart at the end of the dialogue. The values for speaker D
increased (which leads to more variance), while those of speaker F decreased
(leading to less variance).
F0 SD, as a value mirroring F0 Min and F0 Max and thus the variation of F0,
can be interpreted differently. The values for speaker D at the beginning are
smaller, compared to those of speaker F, and increased in the end, whereas it
is the opposite for speaker F. An interpretation of the results for F0 SD is thus
70
6.2 Discussion of results
difficult: one explanation might be that both tried to converge and thereby
increased or decreased too much, so that both speakers did not succeeded to
find an exact balance. The fact that the difference for F0 SD at the beginning
strongly decreased till the end supports this explanation. At the end they
are not similar to the value of the conversational partner, but at least they
approached it. Another possibility is that both speakers diverged and that they
both showed asynchronous behaviour as speaker D’s value increased while that
of speaker F decreased.
Yet another explanation might be that their convergence or divergence de-
pended on the context, such as they had different opinions about certain topics
and that voice parameters varied according to their opinions and on possibly
subsequent changes of emotion and arousal. Feeling more excited about a
topic or being more involved in the conversation might have caused the value
changes at the end for speaker F. The ratings of speaker C about speaker F’s
social attractiveness and competence were high (6 and 7), those of speaker
F about speaker C lower, but still positive (4 and 6). This might indicate
that convergence should be more likely than divergence and that the values
for F0 SD and F0 Min could be rather be interpreted as convergence.
Overall, convergence can be observed for F0 Min and divergence for F0
Max. As Pardo points out “convergence does not result in exact matching
of in all paramters at all times for all interlocutors” [Par12, p. 763]. Thus it
might be possible that not all parameters (also including syntactical, lexical
and sublexical aspects) have to be similar to a certain point in time.
Dialogue H-F
In the dialogue of speakers H and F only speaker H significantly changed on
F0 SD. As she lowers her value towards that of speaker F (and thus exhibits
less variation, leading to a more monotonous speech) convergence can be as-
sumed (see Appendix C.3). Another explanation might be maintenance as she
only changed on one parameter. For the values of speaker F maintenance is
reasonable.
The ratings of speaker H about F were neutral (social attractiveness and
competence, both 0). Thus maintenance can be assumed. The ratings of
speaker F about speaker H are positive (5 and 4). Nevertheless she might have
maintained as well.
71
6 Discussion
Dialogue J-F
In the dialogue of speakers J and F speaker J was the only one who changed
on voice parameters, namely she increased in F0 Min and decreased in F0 Max
(leading to less variation and thus more monotonous speech). Additionally, F0
Mean and F0 SD showed tendencies towards being significant (both lowered,
leading to a deeper voice and monotonous speech). All values can be inter-
preted as convergence as they all approach those of speaker F (see Appendix
C.4). Speaker F, on the other hand, exhibited no significant changes at all, but
only tendency for F0 Min in the direction of the value of speaker J (speaker
F lowered in F0 Min, leading to more variation in speech). Thus it can be
assumed that both speakers converged.
The ratings of speaker J about her conversational partner agree with the F0
values observed as they are (slightly) positive (social attractiveness: 2, com-
petence: 3). The ratings from speaker F about speaker H are both positive,
as the value for social attractiveness is quite positive (4) and that of com-
petence very high (8). Thus the changes on F0 Min could be interpreted as
convergence.
Dialogue K-F
In the dialogue between the speakers K and F speaker F was the only one who
exhibited significant changes, namely on the parameters of F0 SD and HNR.
The lowering of the value of F0 SD (leading to less variation in speech) and
increasing the value of HNR (sounding more hoarsely) can be interpreted as
divergence as they drift apart from the values of speaker K. Maintenance can
be assumed for speaker K (see Appendix C.5).
These findings contradict the rating of speaker K about speaker F as it was
quite positive (social attractiveness: 4, competence: 3). The ratings of speaker
F about speaker K were also positive (5 and 4), nevertheless divergence can
be assumed.
6.2.2 Discussion of Analysis 2 - Speaker F
The second analysis dealt with changes of voice parameters for speaker F in all
six dialogues she had with different conversational partners. Results indicate
72
6.2 Discussion of results
that she significantly changed on F0, mainly F0 Mean, F0 SD and F0 Max.
Compared to the first analysis, in which the voice parameters were analysed
within the individual dialogues, differences can be observed: although speaker
F tended to change on values of F0 in the first analysis, no significant changes
for F0 Mean were found (only one tendency, see Table 5.6). In Analysis 2
(Speaker F), conducted over six dialogues, significant changes for F0 Mean were
found (see Table 5.9). These different results might be caused by discarding
outliers for every single sample collection separately, namely outliers of the
single dialogues (see Table 4.1 and Appendix A) and the sample collection for
speaker F for all dialogues (see Table 4.2 and Appendix B). Thus the sample
collection of all single dialogues for speaker F are not identical to the sample
collection of all dialogues of speaker F. This distinction was made in order
to avoid having sift out samples that are abnormal for speaker F in a single
dialogue, but normal for her among all dialogues.
Overall, the results confirm those of Analysis 1 (Individual dialogues, see
Chapter 5.1) as changes there also mainly occurred for F0. Analysis 1 also
revealed that speakers behaved differently - speaker F changed parameters in
some dialogues (e.g. dialogue D-F) and maintained on others (e.g. dialogue
H-F). This implies that speaker F reacted to different conversational partners
and that she consequently changed parameters of the fundamental frequency.
No significant changes for jitter, shimmer and HNR have been found.
6.2.3 Discussion of Analysis 3 - Position in dialogue
The third analysis (Position in dialogue) concerned changes of voice param-
eters between the beginnings and ends of the individual dialogues. Only few
significant results could be achieved, all within F0 (namely F0 SD, F0 Min,
F0 Max and a tendency towards being significant for F0 Mean) as well there
was a tendency for the changes of jitter towards to be significant (see Table
5.24). These results are similar to those of the first (see Chapter 5.1, Individual
speakers) and the second analysis (see Chapter 5.2, Speaker F) in which also
changes of F0 were (mainly) significant.
73
6 Discussion
6.3 Changes in fundamental frequency
Most changes of voice parameters were related to F0 in all three analyses
(Chapters 5.1-5.3). Gregory and colleagues found that information about for
instance social status [GW96] and dominance [GG02] can be conveyed through
fundamental frequency (or rather frequency bands below 500 Hz). In the
experiment by Gregory et al. participants heard their conversational partner
unaltered or filtered (low-pass filtered at 550 Hz or high-pass filtered at 1000
Hz high-pass filtered) [GDW97]. Convergence was found for the unaltered and
the low-pass filtered conditions, but not for the high-pass filtered condition
in which the regions of F0 had been cut out. Additionally ratings about
the conversation were more negative for the filtered conditions, slightly more
for the high-pass filtered condition. Gregory and colleagues concluded that
F0 plays a significant role as it transfers social information and allows for
accommodation.
In a similar experiment, Babel and Bulatov high-filtered recordings of words
uttered by a male speaker at 300 Hz (and thus cut out regions of F0) [BB11].
Participants then heard the unaltered or the filtered version and were told
to shadow the heard words. Analysis of F0 revealed that in the unaltered
condition repetitions were more similar to the target word as in the filtered
condition. In an additional AXB-perception test, judgements of listeners con-
firmed this result as they rated repetitions in the unaltered condition to be
more similar to the target word as repetitions of the filtered condition. As Ba-
bel and Bulatov point out, the measures of accommodation did not correlate.
They conclude that F0 is a parameter that can be accommodated, but that it
is not the only one, as signals are complex and multiple acoustic features are
available, e.g. VOT [Nie07, Nie10] and vowel quality [Bab09, Bab12].
The present thesis confirmed that speakers change most, perhaps due to
convergence or divergence, on values of F0. Almost no significant changes
were found for other parameters of voice. Thus it can be concluded that
across the dialogues (and thus a great time span) and across the examined
voice parameters convergence of voice mostly occurs for F0.
74
6.4 Temporal aspects
6.4 Temporal aspects
An explanation that changes in voice parameters mainly occurred mainly for
values of F0 and not for other parameters might be due to to the extraction
of samples from the first and the last five minutes of the dialogue. It might be
that accommodation effects occur very early in the dialogue as also Schweitzer
and Lewandowski supposed [SL13]. Nevertheless they could not find an effect
of time in their analyses of articulation rate.
De Looze and colleagues suppose that “mimicry [similar to convergence]
is not a linear phenomenon but rather dynamic” [LORC11, p. 1297]. This
then would imply that in a spontaneous dialogue phases of convergence and
non-convergence could occur.
In the experiment of Levitan and Hirschberg different parameters showed
to be significant when comparing mean values of parameters extracted from
different time spans [LH11]. In their experiment, participants, working in pairs,
played three computer games. They were not able to see each other. Results
indicate that during the first game participants changed significantly on mean
intensity, shimmer and NHR. Regarding the whole session, including all three
computer games, participants changed on jitter and F0 Mean. Regarding the
turn-level even more changes could be found (F0 Mean, F0 Max, shimmer,
NHR). Thus it can be concluded that accommodation/convergence is sensitive
to temporal aspects (and means of measurement), as also the results of Levitan
and Hirschberg differ from the findings of the present thesis. It can be assumed
that some voice parameters tend to change early in conversation and that
others require more time for the participants to converge. Additionally, it
could be possible that some voice parameters change on the global level, i.e.
over the whole conversation, and others on local levels, i.e. from turn to turn
[LH11].
As the dialogues analysed in the present thesis lasted for about 25 min-
utes F0 could possibly be a global parameter (as also changes of F0 Mean
were significant for the whole session in Levitan and Hirschberg’s experiment
[LH11]). Future research could bring more evidence for voice parameters and
their compass in speech.
75
6 Discussion
6.5 Engagement
De Looze and colleagues revealed that becoming more similar on prosodic
parameters (median F0, F0 SD, pauses (number and duration), voice inten-
sity) is correlated with involvement in the interaction rather than agreement1
[LORC11]. This finding agrees with the coordination-engagement hypothesis
of Niederhoffer and Pennebaker which states that “the more that two peo-
ple in a conversation are actively engaged with one another - in a positive
or even negative way - the more verbal and nonverbal coordination [is ex-
pected]” [NP02, p. 358]. The hypothesis assumes that engagement, whether
positive or negative, might influence accommodation/convergence, rather than
rapport [NP02, p. 358]. Also, attention plays a major role as speakers do not
engage if they are not listening to their conversational partner or/and if they
are distracted. The coordination-engagement hypothesis agrees with CAT as
speakers can still converge to achieve social goals or to arrange the conversa-
tion smoothly and or diverge/maintain to emphasize the differences between
themselves and others.
In the study of Schweitzer and Lewandowski speakers’ local articulation
rates were influenced by the preceeding articulation rate of the conversational
partner and social factors, i.e. mutual liking [SL13]. Results of other studies
also revealed that ratings of mutual attractiveness and/ or liking were influ-
ential [PG08, Nat75, ACG+11]. Thus it can be assumed that social factors
have an influence on speech. As the present ratings for social attractiveness
and competence agreed partially with observed changes in voice parameters, it
could be assumed that both, engagement of the speakers, possibly stimulated
by discussed topics, as well as the evaluation of the dialogue partner influenced
speakers and thus convergence. Additional dimensions, including phonetic tal-
ent, as it showed to be influential [Lew12], should also be regarded. In addition,
future research could reveal which factors are influential, how distinct these
are and how they are related.
1Speakers’ agreement was annotated for disagreement, neutral speech and agreement. In-volvement was annotated for a group of speakers (not for individual speakers) on a scalefrom 0 to 10.
76
7 Conclusion and Outlook
In the present thesis voice parameters were investigated at the beginning and
ends of dialogues. Three different analyses were conducted in order to identify
which voice parameters are most prone to accommodation. For this reason,
values for the fundamental frequency (mean, standard deviation, minimum,
maximum), jitter, shimmer and harmonics-to-noise ratio were considered.
Results showed that most significant changes of voice parameters occur for
the fundamental frequency. Thus it can be concluded that accommodation of
voice parameters, at least for the global time span of the dialogues, is expressed
through the average and the variance of the fundamental frequency.
Additionally, Analysis 1, concerning the individual speakers, showed that
voice changes for different parameter combinations, which indicates that ac-
commodation of voice is probably not a fully automatic process as proposed by
Pickering and Garrod [GP04, PG04, PG06, MPG12]. Instead social variables,
for instance to gain the approval of the conversational partner [SG82] or to
arrange the conversation unproblematic and smoothly [GGJ+95], as proposed
in the Communication Accommodation Theory, might influence the (speech)
behaviour of the conversational partners as well as engagement, as proposed
in the coordination-engagement hypothesis, and phonetic talent.
In future work, more detailed analyses could reveal new details about con-
vergence of voice parameters. Therefore voice parameters could be evaluated
on different temporal levels, e.g. over the whole conversation or turns, as some
parameters might change on a global level and others on local levels. Addi-
tionally, detailed analyses could bring new insights on the influences of context
and topic as well as engagement, attention and emotions on voice parameters.
Therefore a detailed questionnaire, revealing speakers’ opinions, impressions
and attitudes towards the discussed topics and the conversational partner and
also to the whole situation during the dialogue, could show what factors are
most influential on convergence of voice and to what extent. This might bring
77
7 Conclusion and Outlook
new insights into the causes and motivations for speakers to change along
different voice parameters.
A possible next step is the evaluation of engagement of conversational part-
ners (in the GeCo corpus). Therefore listeners could rate the degree of engage-
ment and possibly this factor proves influential on accommodation/convergence.
78
Bibliography
[Abe67] Abercrombie, D. (1967). Elements of general phonetics. Volume 203.
Edinburgh: Edinburgh University Press.
[ACG+11] Abrego-Collier, C., Grove, J., Sonderegger, M., Yu A.C.L.
(2011). Effects of speaker evaluation on phonetic convergence. In
Proceedings of the 17th International Congress of Phonetic Sci-
ences (ICPhS XVIII), Hong Kong, China, (192-195). Retrieved
2014, July 9 from http://www.icphs2011.hk/resources/OnlineProceedings/
RegularSession/Abrego-Collier/Abrego-Collier.pdf
[AJL87] Aronsson, K., Jonnson, L., Linell, P. (1987). The courtroom hear-
ing as a middle ground: Speech accommodation by lawyers and de-
fendants. Journal of Language and Social Psychology, 6(2), 99-115.
doi:10.1177/0261927X8700600202
[Azu97] Azuma, S. (1997). Speech accommodation and Japanese
emperor Hiroheto. Discourse and Society, 8(2), 189-202.
doi:10.1177/0957926597008002003
[Bab09] Babel, M.E. (2009). Phonetic and Social Selectivity in Speech Ac-
commodation (Doctoral Dissertation), University of California, Berkeley.
Retrieved 2014, July 9 from http://linguistics.berkeley.edu/dissertations/
Babel dissertation 2009.pdf
[Bab12] Babel, M. (2012). Evidence for phonetic and social selectivity in
spontaneous phonetic imitation. Journal of Phonetics, 40(1), 179-188.
doi:10.1016/j.wocn.2011.09.001
[Bau00] Baugh, J. (2000). Racial identification by speech. American Speech,
75(4), 362-364. doi:10.1215/00031283-75-4-362
[BB11] Babel, M., Bulatov D. (2011). The role of fundamental frequency
in phonetic accommodation. Language and Speech, 55(2), 231-248.
doi:10.1177/0023830911417695
79
Bibliography
[BFB04] Belin, P., Fecteau, S., Bedard, C. (2004). Thinking the voice: neural
correlates of voice perception. Trends in Cognitive Sciences, 8(3), 129-135.
doi:10.1016/j.tics.2004.01.008
[BG77] Bourhis, R.Y., Giles, H. (1977). The language of intergroup distinc-
tiveness. In H. Giles (Ed.), Language, Ethnicity and Intergroup Relations
(119135). London: Academic Press.
[BHL+94] Berry, D.S., Hansen, J.S., Landry-Pester, J.C., Meier, J.A. (1994).
Vocal determinants of first impressions of young children. Journal of Non-
verbal Behavior, 18(3), 187-197. doi:10.1007/BF02170025
[Boe03] Boersma, P. (2003, May 21). Voice 3. Shimmer. Retrieved
2014, June 14 from http://www.fon.hum.uva.nl/praat/manual/Voice 3
Shimmer.html
[Boe09] Boersma, P. (2009). Should jitter be measured by peak picking or
by waveform matching? Folia Phoniatrica et Logopaedica, 61(5), 305-308.
doi:10.1159/000245159
[Boe11] Boersma, P. (2011, March 2). Voice 2. Jitter. Retrieved 2014, June 14
from http://www.fon.hum.uva.nl/praat/manual/Voice 2 Jitter.html
[Bra94] Braun, A. (1994). The effect of cigarette smoking on vocal parameters.
In ESCA Workshop on Automatic Speaker Recognition, Identification and
Verification, Martigny, Switzerland (161-164). Retrieved 2014, July 9 from
www.isca-speech.org/archive open/archive papers/asriv94/sr94 161.pdf
[BW14] Boersma, P., Weenink, D. (2014). Praat: doing phonetics by computer
(Version 5.3.63) [Computer program]. Retrieved 2014, May 18. Available
from http://www.praat.org/
[Byr71] Byrne, D. (1971). The attraction paradigm. New York: Academic
Press.
[CB99] Chartrand, T.L., Bargh, J.A. (1999). The chameleon effect: the
perception-behavior link and social interaction. Journal of Personality
and Social Psychology, 76(6), 893-910. Retrieved 2014, June 9 from http:
//www.yale.edu/acmelab/articles/chartrand bargh 1999.pdf
[CJ97] Coupland, N., Jaworski, A. (1997). Relevance, accommodation and
conversation: Modeling the social dimension of communication. Multilin-
gua, 16(2-3), 233-258. doi:10.1515/mult.1997.16.2-3.233
80
Bibliography
[Cla96] Clark, H.H. (1996). Using language. Cambridge: Cambridge Univer-
sity Press.
[DB01] Dijksterhuis, A., Bargh J.A. (2001). The perception-behavior express-
way: Automatic effects of social perception an social behavior. In M.P.
Zanna (Ed.), Advances in Experimental Social Psychology, Volume 32,
(pp. 1-40). Retrieved 2014, January 1 from http://www.yale.edu/acmelab/
articles/Dijksterhuis Bargh 2001.pdf
[DBSA11] Duggan, A.P., Bradshaw Y.S., Swergold N., Altman W. (2011):
When rapport building extends beyond affiliation: communication overac-
commodation toward patients with disabilities. The Permanente Journal,
15(2), 23-30. Retrieved 2014, June 9 from http://www.ncbi.nlm.nih.gov/
pmc/articles/PMC3140744/pdf/i1552-5775-15-2-23.pdf
[DR10] Dias, J.W. and Rosenblum, L.D. (2010). Visual influences on interac-
tive speech alignment. Perception, 40(12), 1457-1466. doi:10.1068/p7071
[EN93] Edwards, H., Noller, P. (1993). Perceptions of overaccommodation
used by nurses in communication with the elderly. Journal of Language
and Social Psychology, 12(3), 207-223. doi:10.1177/0261927X93123003
[ES96] Ellgring, H., Scherer, K.R. (1996). Vocal indicators of mood
change in depression. Journal of Nonverbal Behavior, 20(2), 83-110.
doi:10.1007/BF02253071
[Esl78] Esling, J. (1978). The identification of features of voice quality in social
groups. Journal of the International Phonetic Association, 8(1-2), 18-23.
doi:10.1017/S0025100300001699
[FBSW03] Fowler, C.A., Brown, J.M., Sabadini, L., Weihing, J. (2003). Rapid
access to speech gestures in perception: Evidence from choice and simple
response time tasks. Journal of Memory and Language, 49(3), 396 - 413.
doi:10.1016/S0749-596X(03)00072-X
[FHE07] Farrus, M., Hernando, J., Ejarque, P. (2007). Jitter and shimmer
measurements for speaker recognition. In Proceedings of the 8th Annual
Conference of the International Speech Communication Association (In-
terspeech 2007), Antwerp, Belgium (778-781). Received 2014, July 18 from
http://nlp.lsi.upc.edu/papers/far jit 07.pdf
81
Bibliography
[FM40] Fay, P.J., Middleton, W.C. (1940). Judgement of intelligence from the
voices transmitted over a public address system. Sociometry, 3(2), 186-191.
doi:10.2307/2785442
[FMK98] Frohlich, M., Michaelis, D., Kruse E. (1998). Objektive Beschrei-
bung der Stimmguter unter Verwendung des Heiserkeits-Diagramms. HNO,
46(7), 684-489. doi:10.1007/s001060050295
[GA04] Goldinger, S.D., Azuma, T. (2004). Episodic memory reflected in
printed word naming. Psychonomic: Bulletin and Review, 11(4), 716-722.
doi:10.3758/BF03196625
[GCC91] Giles, H., Coupland, J., Coupland, N. (1991). Accommodation the-
ory: Communication, context and consequence. In H. Giles, J. Coupland,
N. Coupland (Eds.), Contexts of accommodation. Developments in applied
sociolinguistics (1-68). Westport: Greenwood Publishing Group.
[GDW97] Gregory, S.W., Dagan, K., Webster, S. (1997). Evaluating the re-
lation of vocal accommodation in conversation partner’s fundamental fre-
quencies to perceptions of communication quality. Journal of Nonverbal
Behavior, 27(1), 23-43. doi:10.1023/A:1024995717773
[GER93] Gordon, P.C., Eberhardt, J.L., Rueckl, J.G. (1993). Attentional
modulation of phonetic significance of acoustic cues. Cognitive Psychology,
25(1), 1-42. doi:10.1006/cogp.1993.1001
[GG98] Gallois, C., Giles, H. (1998). Accommodating mutual influence in in-
tergroup encounters. In M.T. Palmer, G.A. Barnett (Eds.), Progress in
communication sciences, Volume 14, (135-162).
[GG02] Gregory, S.W., Gallagher, T.J. (2002). Spectral analysis of candidates’
nonverbal vocal communication: Predicting U.S. presidential election out-
comes. Social Psychology Quarterly, 65(3), 298-308. Retrieved 2014, July 9
from http://www.jstor.org/stable/3090125
[GG13] Giles, H., Gasiorek , J. (2013). Parameters of nonaccommodation:
Refining and elaborating Communication Accommodation Theory. In J.P.
Forgas, O. Vincze, J. Laszlo (Eds.), Social cognition and communication.
The Sydney Symposium of Social Pschology. (155-172). New York: Psy-
chology Press.
82
Bibliography
[GGJ+95] Gallois, C., Giles, H., Jones, E., Cargile, A.C., Ota, H. (1995).
Accommodating intercultural encounters: Elaborations and extensions. In
R. Wisemann (Ed.), Intercultutal communication theory (115-1147). Thou-
sand Oaks: Sage publications.
[GH82] Gregory, S.W, Hoyt, B.R. (1982). Conversation partner mutual adap-
tation as demonstrated by fourier series analysis. Journal of Psycholinguis-
tic Research, 11(1), 35-46. doi:10.1007/BF01067500
[Gil73] Giles, H. (1973). Accent mobility: A model and some data. Anthro-
pological Linguistics, 15(2), 87-109. Retrieved 2014, July 9 from http:
//www.jstor.org/stable/30029508
[GLS13] Garnier, M., Lamalle, L., Sato, M. (2013). Neural correlates in pho-
netic convergence and speech imitation. Frontiers in Psychology, 4(600),
15 pages. doi:10.3389/fpsyg.2013.00600
[GO06] Giles, H., Ogay, T. (2006). Communication accommodation theory. In
B. B. Whaley and W. Samter (Eds.), Explaining Communication: Con-
temporary theories and exemplars (293-310). Mawah: Lawrence Erlbaum
Assosiates.
[Gol96] Goldinger, S.D. (1996). Words and voices: Episodic traces in spoken
identification and recognition memory. Journal of Experimental Psychol-
ogy: Learning, Memory and Cognition, 22(5), 1166-1183. Retrieved 2014,
July 9 from http://www.public.asu.edu/∼sgolding/docs/pubs/Goldinger
JEPLMC 96.pdf
[Gol97] Goldinger, S.D (1997). Perception and production in an episodic lex-
icon. In K. Johnson, J.W. Mullennix (Eds.), Talker variability in speech
processing (33-66). San Diego: Academic Press.
[Gol98] Goldinger, S.D. (1998). Echoes of echoes? An episodic the-
ory of lexical access. Psychological Review, 105(2), 251-279. Retrieved
2014, July 9 from http://www.cog.brown.edu/courses/cg195/pdf files/
Goldinger%201998.pdf
[Gol13] Goldinger, S.D. (2013). The cognitive basis of spontaneous imitation:
Evidence from the visual world. In Proceedings of Meetings in Acoustics,
Volume 19, Montreal, Canada (6 pages). doi:10.1121/1.4800039
83
Bibliography
[GOG05] Gallois, C., Ogay, T., Giles, H. (2005): Communication Accommo-
dation Theory, In W.B. Gudykunst (Ed.), Theorizing about intercultural
communication (121-148). Thousand Oaks: Sage publications.
[GP04] Garrod, S., Pickerin M.J. (2004). Why is conversation so easy? Trends
in Cognitive Sciences, 8(1), 8-11. doi:10.1016/j.tics.2003.10.016
[GP75] Giles, H., Powesland, F.P. (1975). Speech style and social variation,
Volume 7 of European monographs in social psychology. London: Academic
Press.
[GPL91] Goldinger, S., Pisoni, D., Logan J. (1991). On the nature of talker
variability effects on recall of spoken word lists. Journal of Experimen-
tal Psychology: Learning, Memory and Cognition, 17(1), 152-162. Re-
trieved 2014, July 9 from http://www.public.asu.edu/∼sgolding/docs/
pubs/Goldinger etal JEPLMC 91.pdf
[Gre90] Gregory, S. W. (1990). Analysis of fundamental frequency reveals co-
variation in interview partners’ speech. Journal of Nonverbal Behavior,
14(4), 237-251. doi:10.1007/BF00989318
[GS10] Goudbeek, M., Scherer, K. (2010). Beyond arousal: Valence and po-
tency/control cues in the vocal expression of emotion. Journal of the Acous-
tical Society of America, 128(3), 1322-1336. doi:10.1121/1.3466853
[GTB73] Giles, H, Taylor, D.M., Bourhis, R. (1973). Towards a theory of in-
terpersonal accommodation through language: Some Canadian data. Lan-
guage in Society, 2(2). 177-192. doi:10.1017/S0047404500000701
[GW96] Gregory, S.W., Webster, S. (1996). A nonverbal signal in voices of in-
terview partners effectively predicts communication accommodation and
social status perception. Journal of Personality and Social Psychology,
70(6), 1231-1240. Retrieved 2014, July 9 from http://www.columbia.edu/
∼rmk7/HC/HC Readings/Gregory.pdf
[HDJ+01] Hollien, H., DeJong, G., Martin, C.A., Schwartz, R., Lilje-
gren, K. (2001). Effects of ethanol intoxication on speech suprasegmen-
tals.Journal of the Acoustical Society of America, 110(6), 3198-32056.
doi:10.1121/1.1413751
[Hin86] Hintzman, D. (1986). Schema abstraction in multiple-trace
memory model. Psychological Review, 93(4), 411-428. Retrieved
84
Bibliography
2014, July 9 from http://www.sfs.uni-tuebingen.de/∼gjaeger/lehre/ss08/
exemplarBased/hintzman86.pdf
[Hog85] Hogg, M. (1985). Masculine and feminine speech in dyads and groups:
A study of speech style and gender salience. Journal of Language and So-
ciety, 4(2), 99-112. doi: 10.1177/0261927X8500400202
[Inc12] IBM Corporation (2012): IBM SPSS Statistics for Windows (Version
21.0) [Computer program]. New York: IBM Corporation.
[JF70] Jaffe, J., Feldstein, S. (1970). Rhythms of Dialogue. New York: Aca-
demic Press.
[JL13] Janssen, J., Laatz, W. (2013). Statistische Datenanalyse mit SPSS.
Eine anwendungsorientierte Einfuhrung in das Basissystem und das Modul
Extakte Tests. Berlin, Heidelberg: Springer Gabler Verlag.
[Joh03] Johnson, K. (2003). Acoustic and auditory phonetics. 2nd edition. Ox-
ford: Blackwell Publishing.
[KFM02] Krauss, R.M., Freyberg R., Morsella E (2002). Inferring speakers’
physical attributes from their voices. Journal of Experimental Social Psy-
chology, 38(6), 618625. doi:10.1016/S0022-1031(02)00510-3
[KJS09] Ko, S.J., Judd, C.M., Stapel, D.A. (2009). Stereotyping based on
voice in the presence of individuating information: Vocal femininity affects
perceived competence but not warmth. Personality and Social Psychology
Bulletin, 35(2), 198-211. doi:10.1177/0146167208326477
[KP04] Krauss, R. M. and Pardo, J. S. (2004). Commentary on Pickering and
Garrod. Is alignment always the result of automatic priming? Behavioral
and Brain Sciences, 27(2), 203-204. doi:10.1017/S0140525X0436005X
[KS11] Kreimanm J. and Sidtis, D. (2011). Foundations of voice studies.
An interdisciplinary approach to voice production and perception. Oxford:
Wiley-Blackwell.
[Lav68] Laver, J.M.D. (1968). Voice quality and indexical information. British
Journal of disorders of Communication, 3(1), 43-54. Retrieved 2014, July
10 from http://vambo.cent.gla.ac.uk/media/media 200297 en.pdf
[Lav76/97] Laver, J. (1997). Language and non-verbal communication. In J.
Laver (Ed.), The gift of speech: readings in the analysis of speech and voice
(131-146). Edinburgh: Edinburgh University Press.
85
Bibliography
(Reprinted from Language and Speech, Volume 7 of Handbook of Percep-
tion, 345-363, by E.C. Carterette, M.P. Friedmann, Eds., 1976, New York:
Academic Press)
[Lav03] Laver, J. (2003). Three semiotic layers of spoken communication.
Journal of Phonetics, 31(3-4), 413-415. doi:10.1016/S0095-4470(03)00034-2
[Lew12] Lewandowski, N. (2012): Talent in nonnative phonetic dialogue (Doc-
toral Dissertation). University of Stuttgart, Stuttgart. Retrieved 2014,
July 10 from http://elib.uni-stuttgart.de/opus/volltexte/2012/7402/pdf/
Lewandowski.pdf
[LH11] Levitan, R., Hirschberg, J. (2011). Measuring acoustic-prosodic en-
trainment with respect to multiple levels and dimensions. In Proceed-
ings of the 12th Annual Conference of the International Speech Com-
munication Association (Interspeech 2011), Florence, Italy (3081-3084).
Retrieved 2014, July 10 from http://www.cs.columbia.edu/∼julia/papers/
levitan&hirschberg11.pdf
[Lig89] Lightfoot, N. (1989). Effects of familiarity on serial recall for spoken
word lists. Research on speech perception, Progress report No. 15., 421-443.
Retrieved 2014, July 10 from http://files.eric.ed.gov/fulltext/ED318074.
[Lin98] Linville, S. (1998). Acoustic correlates of perceived versus actual sexual
orientation in mens speech. Pholia Phoniatrica et Logopaedica, 50(1), 3548.
doi:10.1159/000021447
[LJB05] Laukka, P., Juslin, P.N., Bresin, R. (2005). A dimensional approach
to vocal expression of emotion. Cognition and Emotion, 19(5), 633-653.
doi:10.1080/02699930441000445
[LLA+08] Laukka, P., Linnman, C., Ahs, F., Pissiota, A., Frans, O., Faria,
V., Michelgard, A., Appel, L., Frederikson, M., Furmark, T. (2008). In a
nervous voice: acoustic analysis and perception of anxiety in social phobic’s
speech. Journal of Nonverbal Behavior, 32(4), 195-214. Retrieved 2014,
July 10 from http://www.ohio.edu/people/leec1/documents/sociophobia/
Laukka Petri.pdf
[LORC11] de Looze, C., Oertel, C., Rauzy, S., Campbell (2011). Mea-
suring dynamics of mimicry by means of prosodic cues in conversa-
tional speech. In Proceedings of the 17th International Congress of Pho-
86
Bibliography
netic Sciences (ICPhS XVII), Hong Kong, China (1294-1297). Retrieved
2014, July 9 from http://www.icphs2011.hk/resources/OnlineProceedings/
RegularSession/de%20Looze/de%20Looze.pdf
[LP05] Levi, S., Pisoni D. P. (2005). Indexical and Linguistic Channels
in Speech Perception: Some Effects of Voiceovers on Advertising Out-
comes. Research on spoken language processing, Progress Report No. 27,
65-80. Retrieved 2014, July 10 from http://www.iu.edu/∼srlweb/pr/27/
65-Levi-Pisoni.pdf
[MATB14] McAleer, P., Todorov, A., Belin, P. (2014). How do you say
’Hello’? Personality Impressions from brief novel voices. PLoS ONE, 9(3),
doi:10.1371/journal.pone.0090779
[May13] Mayer, J. (2013). Phonetische Analysen mit Praat. Ein Handbuch
fr Ein- und Umsteiger. Retrieved 2014, May 18 from http://praatpfanne.
lingphon.net/downloads/praat manual.pdf
[MD87] Murphy, C.H., Doyle P.C. (1987). The effects of cigarette smoking on
voice-fundamental frequency. Otolaryngol Head Neck Surgery, 97(4), 376-
380. Retrieved 2014, July 10 from http://oto.sagepub.com/content/97/4/
376
[MMPS89] Martin, C.S., Mullennix, J.W., Pisoni, D.B., Summers, W.V.
(1989). Effects of talker variability on recall of spoken word lists. Jour-
nal of Experimental Psychology: Learning, Memory and Cognition, 15(4),
676-684. Retrieved 2014, July 10 from http://www.ncbi.nlm.nih.gov/pmc/
articles/PMC3510481/pdf/nihms418731.pdf
[MPG12] Meneti, L., Pickering, M.J., Garrod, S.C. (2012). Toward a neural
basis of interactive alignment in conversation. Frontiers in Human Neuro-
science, 6(185), 9 pages. doi:10.3389/fnhum.2012.00185
[MS08] Minnema, W., Stoll H.-C., (2008). Objektive computergesttzte Stim-
manalyse mit Praat. Forum Logopdie, 4(22), 24-29. Retrieved 2014, July
10 from https://www.wevosys.com/knowledge/ data knowledge/13.pdf
[MTHH14] Moreau, M.L., Thiam, N., Harmegnies, B., Huet, K.(2014). Can
listeners assess the sociocultural status of speakers who use a language they
are unfamiliar with? A case study of Senegalese and European students
listening to Wolof speakers. Journal of Language in Society, 43(3), 333-348.
doi:10.1017/S0047404514000220
87
Bibliography
[Mur99] Murphy, P.J. (1999). Perturbation-free measurement of the
harmonics-to-noise ratio in voice signals using pitch synchronous harmonic
analysis. Journal of the Acoustical Society of America, 105 (5), 2866-28881.
doi:10.1121/1.426901
[Nat75] Natale, M. (1975). Convergence of mean vocal intensity in dyadic
communication as a function of social desirability. Journal of Personality
and Social Psychology, 32(5), 790-804. doi:10.1037/0022-3514.32.5.790
[NFG06] Nawka T., Franke I., Galkin, E. (2006): Objektive Messverfahren in
der Stimmdiagnostik. Forum Logopdie, 4(20), 14-21. Retrieved 2014, July
10 from https://www.wevosys.com/knowledge/ data knowledge/7.pdf
[NP02] Niederhoffer, K.G., Pennebaker, J.W. (2002). Linguistic style match-
ing in social interaction. Journal of Language and social psychology, 21(4),
337-360. doi:10.1177/026192702237953
[Nie07] Nielsen, K.Y. (2007). Implicit phonetic imitation is constrained by
phonemic contrast. In Proceedings of the 16th International Congress of
Phonetic Sciences (ICPhS XVI), Saarbrucken, Germany (1961-1964). Re-
trieved 2014, July 10 from http://www.icphs2007.de/conference/Papers/
1641/1641.pdf
[Nie10] Nielsen, K.Y. (2010): Specificity and abstractness of VOT imitation.
Journal of Phonetics, 39(2), 132-142. doi:10.1016/j.wocn.2010.12.007
[NNS02] Namy, L.L., Nygaard, C. L., Sauerteig D. (2002). Gender differences
in vocal accommodation: The role of perception. Journal of Language and
Social Psychology, 21(4), 422-432. doi:10.1177/026192702237958
[NQ08] Nygaard, L.C., Queen, J.S. (2008). Communicating emotion: Linking
affective prosody and word meaning. Journal of Experimental Psychology:
Human Perception and Performance, 34(4), 1017-1030. doi: 10.1037/0096-
1523.34.4.1017
[NSP94] Nygaard, L.C., Sommers, M.S., Pisoni, D.B. (1994). Speech per-
ception as a talker-contingent process. Psychological Science, 5(1), 42-46.
doi:10.1111/j.1467-9280.1994.tb00612.x
[NW11] Newman, M., Wu, A. (2004): Do you sound Asian when you speak En-
glish? Racial identification and voice in Chinese and Koreans’s American
English. American Speech, 86(2), 152-178. doi:10.1215/00031283-1336992
88
Bibliography
[Par06] Pardo, J. (2006). On phonetic convergence during conversational in-
teraction. Journal of Acoustical Society of America, 119(4), 2382-2393.
doi:10.1121/1.2178720
[Par10] Pardo, J. (2010). Expressing oneself in conversational interaction. In E.
Morsella (Ed.), Expressing oneself/ expressing ones self: communication,
cognition, language, and identity (183-196). London: Taylor and Francis.
[Par12] Pardo, J. (2012). Reflections on phonetic convergence: Speech percep-
tion does not mirror speech production. Language and Linguistics Compass,
6(12), 753-767. doi:10.1002/lnc3.367
[Pea31] Pear, T.H. (1931). Voice and personality as applied to radio broadcast-
ing. New York: Wiley.
[PG04] Pickering, M.J., Garrod S. (2004). Toward a mechanistic psy-
chology of dialogue. Behavioral and brain sciences, 27(2), 169-190.
doi:10.1017/S0140525X04000056
[PG06] Pickering, M.J., Garrod, S. (2006). Alignment as the Basis for Suc-
cessful Communication. Research on Language and Computation, 4(2-3),
203-228. Retrieved 2014, July 10 from http://www.speech.kth.se/∼edlund/
bielefeld/references/pickering-and-garrod-2006.pdf
[PG08] Pitts, M.J., Giles, H. (2008). Social psychology and personal relation-
ships: Accommodation and relational influences across time and contexts.
In: G. Antos, V. Ventola (Eds.), Handbook of Interpersonal Communica-
tion, Volume 2 of Handbook of Applied Linguistics (15-31). Berlin: Mouton
de Gruyter.
[PGP93] Palmeri, Goldinger, Pisioni (1993). Episodic encoding of voice at-
tributes and recognition memory for spoken words. Journal of Exper-
imental Psychology: Learning, Memory and Cognition, 19(2), 309-328.
Retrieved 2014, July 10 from http://www.ncbi.nlm.nih.gov/pmc/articles/
PMC3499966/pdf/nihms418709.pdf
[PGSK12] Pardo, J., Gibbons, R., Suppes, A., Krauss, R.M. (2012). Phonetic
convergence in college roommates. Journal of Phonetics, 40(1), 190-197.
doi:10.1016/j.wocn.2011.10.001
[Pie01] Pierrehumbert, J. (2001). Exemplar dynamics: word frequency, leni-
tion and contrast. In Bybee, J. and Hopper P. (Eds.), Frequency effects
89
Bibliography
and the emerge of linguistic structure, Volume 45 of Typological studies of
language(137-157). Amsterdam: John Benjamins Publishing.
[PJK10] Pardo, J., Jay, I.C., Krauss, R.M. (2010). Conversational role influ-
ences speech imitation. Attention, Perception and Psychophysics, 72(8),
2254-2264. doi:10.3758/BF03196699
[PLB+09] Petrovic-Lazic M., Babac, S., Vukovic, M., Kosanovic, R. and
Ivankovic, Z. (2009): Acoustic Voice Analysis of Patients With Vocal Fold
Polyp Journal of Voice,25(1), 94-97. doi:10.1016/j.jvoice.2009.04.002
[RST04] Ryalls, J., Simon, M., Thomason, J. (2004). Voice Onset Time pro-
duction in older Caucasian- and African-Americans. Journal of multilin-
gual communication disorders, 2(1), 61-67. Retrieved 2014, July 10 from
http://informahealthcare.com/doi/pdf/10.1080/1476967031000090980
[RZB97] Ryalls, J., Zipprer, A. Baldauff, P. (1997). A preliminary investiga-
tion of the effects of gender and race on Voice Onset Time. Journal of
Speech, Language and Hearing Research, 40(3), 642-645. Retrieved 2014,
July 10 from http://jslhr.pubs.asha.org/article.aspx?articleid=1781846
[Sau01] de Saussure, F. (2001). Grundfragen der allgemeinen Sprachwis-
senschaft (3rd ed.). (H. Lommel, Transl.). Berlin: de Gruyter. (Original
work published 1916)
[SBS75] Smith, B.L., Brown, B.L. Strong W.J., Rencher A.C. (1975). Effects
of speech rate on personality perception. Language and Speech, 18(2), 145-
152. Retrieved 2014, July 10 from http://las.sagepub.com/content/18/2/
145.full.pdf
[SBP83] Street, R.L., Brady, R.M., Putman, W.B. (1983). The influence
of speech rate stereotypes and rate similarity or listeners’ evaluations
of speakers. Journal of Language and Social Psychology, 2(1), 37-56.
doi:10.1177/0261927X8300200103
[SBW01] Scherer, K.R., Banse, R., Wallbott, H.G. (2001). Emotion inferences
from vocal expression correlate across languages and cultures. Journal of
cross-cultural psychology, 32(1), 76-92. doi: 10.1177/0022022101032001009
[Sch86] Scherer, K.R. (1986). Vocal affect expression: a review and a
model for future research. Psychological Bulletin, 99(2), 143-165. Re-
trieved 2014, July http://www.affective-sciences.org/system/files/biblio/
1986 Scherer PsyBull.pdf
90
Bibliography
[Sch03] Scherer, K.R. (2003). Vocal communication of emotion: A re-
view of research paradigms. Speech Communication, 40(1), 227-256.
doi:10.1016/S0167-6393(02)00084-5
[Schi11] Schiel, F. (2011). Perception of alcoholic intoxication in speech. In
Proceedings of the 12th Annual Conference of the International Speech
Communication Association (Interspeech 2011), Florence, Italy (3281-
3284). Retrieved 2014, July 10 from http://www.phonetik.uni-muenchen.
de/forschung/publikationen/Schiel-IS2011.pdf
[Schw12] Schweitzer, K. (2012). Frequency effects on pitch accents: Towards
an exemplar-theoretic approach to intonation (Doctoral Dissertation). Uni-
versity of Stuttgart, Stuttgart. Retrieved 2014, July 10 from http://elib.
uni-stuttgart.de/opus/volltexte/2013/7997/pdf/Schweitzer2012.pdf
[SF97] Sancier, M.L., Fowler, C.A. (1997). Gestural drift of a bilingual speaker
of Brazilian Portuguese and English. Journal of Phonetics, 25(4), 421-436.
doi:10.1006/jpho.1997.0051
[SG82] Street, R.L., Giles, H. (1982). Speech Accommodation Theory: A so-
cial cognitive approach to language use and speech behavior. In M. Roloff,
C. Berger (Eds.), Social cognition and communication. Beverly Hills: Sage.
[SGLP01] Shepard, C.A., Giles, H., Le Poire, B.A. (2001). Communication Ac-
commodation Theory. In W.P. Robinson, H. Giles (Eds.), The New Hand-
book of Language and Social Psychology (33-56). Chichester: Wiley.
[SH82] Sorensen, D., Horii, Y. (1982). Cigarette smoking and voice funda-
mental frequency. Journal of communication disorders, 15 (2), 135-144.
doi:10.1016/0021-9924(82)90027-2
[SL12] Schweitzer, A., Lewandowski, N. (2012). Accommodation of backchan-
nels in spontaneous speech (Abstract). Book of Abstracts of the In-
ternational Symposium on Imitation and Convergence in Speech, Aix-
en-Provence, France, 2012, September 3-5. Retrieved 2014, July
9 from http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schweitz/
docs/SchweitzerLewandowski2012.pdf
[SL13] Schweitzer, A. and Lewandowski, N. (2013). Convergence of ar-
ticulation rate in spontaneous speech. In Proceedings of the 14th An-
nual Conference of the International Speech Communication Associa-
tion (Interspeech 2013), Lyon, France (525-529). Retrieved 2014, July
91
Bibliography
10 from http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schweitz/
docs/SchweitzerLewandowski2013.pdf
[SLD14] Schweitzer, A., Lewandowski, N., Dogil, G. (2014). Ad-
vancing corpus-based analyses of spontaneous speech: Switch to
GECO! In Proceedings of the 14th conference on Laboratory Phonol-
ogy (LabPhon), Tokyo, Japan (1 page). Retrieved 2014, July
10 from http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schweitz/
docs/SchweitzerLewandowskiDogil2014.pdf
[SMA+82] Streeter, L.A., Macdonald, N.H., Apple, W., Krauss, R.M., Ga-
lotti, K.M. (1982). Acoustic and perceptual indicators of emotional
stress. Journal of the Acoustical Society of America, 73(4), 1354-1360.
doi:10.1121/1.389239
[SSL94] Stemple, J.C., Stanley, J., Lee, L. (1994). Objective measures of voice
production in normal subjects following prolonged voice use. Journal of
voice, 9(2), 127-133. doi:10.1016/S0892-1997(05)80245-0
[TE95] Traunmller, H., Eriksson, A. (1994). The frequency range of the voice
fundamental in the speech of male and female adults. Retrieved 2014, July
8 from http://www2.ling.su.se/staff/hartmut/f0 m&f.pdf
[TH80] Todt, E.H., Howell, R.J. (1980). Vocal cues as indices of
schizophrenia. Journal of Speech and Hearing Research, 23(3), 517-526.
doi:10.1044/jshr.2303.517
[Tri60] Triandis, H.C. (1960). Cognitive similarity and communication in a
dyad. Human Relations, 13(2), 175-183. doi:10.1177/001872676001300206
[VLKE85] Van Lancker, D., Kreimann, J., Emmorey, K. (1985). Familiar
voice recognition: Patterns and parameters. Part I: Recognition of back-
ward voices. Journal of Phonetics 13(1), 19-38. Retrieved 2014, July
10 from http://www.surgery.medsch.ucla.edu/glottalaffairs/papers/van%
20lancker%20-%20kreiman%20-%20emmorey%201985.pdf
[Web70] Webb, J.T. (1970). Interview synchrony. In A.W. Siegmann, B. Pope
(Eds.), Studies in dyadic communication: Proceedings of research confer-
ence on interview (115-133). New York: Pergamon.
92
Bibliography
[WJE+02] Wittels, P., Johannes, B., Enne, R., Kirsch, K. Gunga, H.C. (2002).
Voice monitoring to measure emotional load during short-term stress. Eu-
ropean Journal of Applied Physiology, 87(3), 278-282. doi:10.1007/s00421-
002-0625-1
[WM03] Wellham, N.V., Maclagan, M.A. (2003). Vocal fatigue: Cur-
rent knowledge and future directions. Journal of voice, 17(1), 21-30.
doi:10.1016/S0892-1997(03)00033-X
[WJM+04] Wallis, L., Jackson-Menaldi, C., Holland, W., Giraldo, A. (2004).
Vocal fold nodule vs. vocal fold polyp: Answer from surgical patholo-
gist and voice pathologist point of view. Journal of Voice, 18(1), 125129.
doi:10.1016/j.jvoice.2003.07.003
[WS72] Williams, C.E., Stevens, K.N. (1972). Emotions and speech: Some
acoustical correlates. Journal of the Acoustical Society of America , 52(4),
1238-1250. doi:10.1121/1.1913238
[YWB82] Yumoto, E., Wilbur, W.J., Baer, T. (1982). Harmonics-to-noise
ratio as an index of the degree of hoarseness. Journal of Acoustical So-
ciety of America, 71(6), 1544-1550. Retrieved 2014, July 10 from http:
//web.haskins.yale.edu/Reprints/HL0373.pdf
[YSO84] Yumoto E., Sasaki Y., Okamura H. (1984). Harmonics-to-noise ratio
and psychophysical measurement of the degree of hoarseness. Journal of
speech and hearing research, 27(1), 2-6. Retrieved 2014, July 10 from http:
//web.haskins.yale.edu/Reprints/HL0373.pdf
93
A Outliers - Individual speakers
Speaker& partof dia-logue
Speakersin dia-logue
F0 F0 SD F0Min
F0Max
Jitt Shim HNR
A-beg A & F ∗16 ◦3, ◦9,◦13
◦3
A-end A & F ∗4, ◦7 ∗14
F-beg A & F ◦2, ◦10 ◦14 ◦14 ◦4
F-end A & F ◦1 ∗9,◦1, ◦2,◦12,◦15,◦20
C-beg C & F ◦4 ∗12 ◦ 12
C-end C & F ∗2,∗21,∗23,◦7, ◦19
◦13 ◦8, ◦26
F-beg C & F ∗9, ◦1 ◦8 ◦21 ◦1,◦18,◦22
◦12
F-end C & F ∗16,◦2,◦17,◦20
◦4 ∗10,∗14,∗16,◦12
◦16 ◦11
D-beg D & F ◦15 ∗23, ◦5 ◦7
D-end D & F
F-beg D & F ◦4, ◦3 ◦16
F-end D & F ◦2, ◦11 ◦3,◦11,◦17
◦4, ◦18
H-beg H & F ∗6 ◦6 ◦3, ◦6 ◦18 ◦4, ◦6 ∗6
H-end H & F ∗4, ∗13 ∗13, ◦3 ◦13
F-beg H & F ◦9 ◦7 ◦7, ◦10 ◦4, ◦15 ◦4
F-end H & F ∗15,◦5, ◦11
◦10 ◦7, ◦10
97
A Outliers - Individual speakers
Speaker& partof dia-logue
Speakersin dia-logue
F0 F0 SD F0Min
F0Max
Jitt Shim HNR
J-beg J & F ◦3, ◦4,◦10
◦2, ◦15 ◦5, ◦8
J-end J & F ∗11 ◦16
F-beg J & F ◦1 ∗2, ◦1,◦19,◦20,◦22
◦1, ◦3 ◦3, ◦18 ∗16 ◦7
F-end J& F ◦14 ◦14 ∗14,◦1, ◦2
K-beg K & F ◦1 ◦1 ◦10,◦13
◦2
K-end K & F ◦21
F-beg K & F ◦8 ◦1, ◦5,◦7, ◦8
F-end K & F ◦15, 20 ◦8 ◦5
Outliers for the beginnings and ends for the individual speakers of each dialogue,recognized with the help of Boxplots, generated with SPSS [Inc12]. Outliers markedby ∗are extreme values, defined as a score that is greater than 3 box lengths awayfrom the upper or lower hinge of the box [JL13, p. 242]. Those are discarded fromthe analyses. Outliers, marked by ◦, defined as samples in score that is between 1.5and 3 box lengths away from the upper or lower hinge of the box [JL13, p. 242], areincluded in the analyses. The numbers next to the outliers mark the correspondingsamples.
98
B Outliers - speaker F
Speaker& partof dia-logue
F0 F0 SD F0 Min F0 Max Jitt Shim HNR
F-beg ◦52,◦125
◦54, ◦55 ◦2, ◦10,◦38,◦45,◦52,◦54,◦55,◦67, ◦68
∗35,◦76,◦85, ◦93
◦35
F-end ◦38,◦45,◦54,◦129
◦26,◦104,◦129
∗46,∗54,∗60,∗129,◦9, ◦45
◦75,◦77,◦85,◦101,◦115
◦75,◦101
◦75,◦101
Outliers for the sample collections of the beginning and end of speaker F in allsix dialogues, recognized with the help of Boxplots, generated with SPSS [Inc12].Outliers marked by ∗are extreme values, defined as a score that is greater than 3box lengths away from the upper or lower hinge of the box [JL13, p. 242]. Thoseare discarded from the analyses. Outliers, marked by ◦, defined as samples in scorethat is between 1.5 and 3 box lengths away from the upper or lower hinge of thebox [JL13, p. 242], are included in the analyses. The numbers next to the outliersmark the corresponding samples.
99
C Results of Analysis 1 -Individual speakers
C.1 Dialogue A-F
Speaker A
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 198.8632 24 11.60327 2.36851end 202.67775 24 13.600890 2.776270
Pair 2 F0 SD beginning 40.9575 24 13.74137 2.80495end 37.54388 24 14.190584 2.896641
Pair 3 F0 Min beginning 97.5000 23 5.72999 1.19479end 99.30291 23 19.203485 4.004203
Pair 4 F0 Max beginning 305.5100 23 52.01498 10.84587end 306.25752 23 55.679642 11.610008
Pair 5 Jitter beginning 2.5478 23 0.58423 0.12182end 2.31452 23 0.378722 0.078969
Pair 6 Shimmer beginning 9.5035 24 1.45664 0.29734end 8.70875 24 1.415823 0.289004
Pair 7 HNR beginning 15.7147 24 1.72543 0.35220end 16.78892 24 1.438183 0.293568
Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker A from the beginnings and ends of the dialogue with speaker F.
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 Mean -3.814542 19.328867 3.945488 -0.967 23 0.344Pair 2 F0 SD 3.413625 19.856913 4.053275 0,842 23 0.408Pair 3 F0 Min -1.802957 20.017503 4.173938 -0.432 22 0.670Pair 4 F0 Max 2.112417 87.497135 17.860278 0.118 23 0.907Pair 5 Jitter 0.233261 0.624782 0.130276 1.791 22 0.087Pair 6 Shimmer 0.794750 2.020564 0.412446 1.927 23 0.066Pair 7 HNR -1.074208 2.213673 0.451864 -2.377 23 0.026
Results of the paired t-tests, comparing samples of speaker A from the beginningand the end. The parameter of HNR is significant and shimmer exhibits a tendencyfor being significant.
101
C Results of Analysis 1 - Individual speakers
Speaker F
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 205.4800 22 14.22599 3.03299end 204.61132 22 13.096973 2.792284
Pair 2 F0 SD beginning 27.2404 22 12.59454 2.68516end 16.94927 22 6.143384 1.309774
Pair 3 F0 Min beginning 129.0219 22 29.69783 6.33160end 150.67495 22 29.876391 6.369668
Pair 4 F0 Max beginning 299.2405 21 49.68683 10.84255end 266.68143 21 22.423106 4.893123
Pair 5 Jitter beginning 1.9809 22 0.45973 0.09802end 1.94086 22 0.440796 0.093978
Pair 6 Shimmer beginning 8.5525 22 1.52587 0.32532end 8.20800 22 1.314742 0.280304
Pair 7 HNR beginning 18.0233 22 2.05151 0.43738end 18.39323 22 1.868531 0.398372
Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker F from the beginnings and ends of the dialogue with speaker A.
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 Mean 0.868682 21.487454 4.581141 0.190 21 0.851Pair 2 F0 SD 10.291136 14.985259 3.194868 3.221 21 0.004Pair 3 F0 Min -21.653045 40.206783 8.572115 -2.526 21 0.020Pair 4 F0 Max 32.559095 49.352425 10.769582 3.023 20 0.007Pair 5 Jitter 0.040045 0.587328 0.125219 0.320 21 0.752Pair 6 Shimmer 0.344455 2.167322 0.462075 0.745 21 0.464Pair 7 HNR -0.369909 2.474797 0.527629 -0.701 21 0.491
Results of the paired t-tests, comparing samples of speaker F from the beginningand the end within the dialogue with speaker A. Significant parameters are F0 SD,F0 Min and F0 Max.
Speaker F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR
A (∗) ∗F ∗ ∗ ∗
Significant changes of voice correlates for the speakers A and F in their dialogue.Significant values are marked by ∗, tendencies are marked by (∗).
102
C.2 Dialogue C-F
C.2 Dialogue C-F
Speaker C
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 216.76824 29 16.455194 3.055653end 218.16552 29 15.574690 2.892147
Pair 2 F0 SD beginning 33.42803 29 11.400413 2.117004end 32.05234 29 10.207541 1.895493
Pair 3 F0 Min beginning 125.65093 29 30.399702 5.645083end 113.03931 26 27.499529 5.393101
Pair 4 F0 Max beginning 321.62786 29 37.299171 6.926282end 319.56734 29 33.092452 6.145114
Pair 5 Jitter beginning 2.26893 28 0.419790 0.079333end 2.46711 28 0.827996 0.156476
Pair 6 Shimmer beginning 7.676710 29 1.395318 0.259104end 8.37276 29 2.160529 0.401200
Pair 7 HNR beginning 19.11282 29 2.104718 0.397754end 18.95059 29 3.150715 0.585073
Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker C from the beginnings and ends of the dialogue with speaker F.
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 Mean -1.397276 22.832798 4.239944 -0.330 28 0.744Pair 2 F0 SD 1.375690 15.269180 2.835416 0.485 28 0.631Pair 3 F0 Min 12.332654 45.460089 8.915457 1.383 25 0.179Pair 4 F0 Max 2.060517 48.476788 9.001914 0.229 28 0.821Pair 5 Jitter -0.198179 0.981307 0.185450 -1.069 27 0.295Pair 6 Shimmer -0.696655 2.591666 0.481260 -1.448 28 0.159Pair 7 HNR 0.034393 3.961333 0.748622 0.046 27 0.964
Results of the paired t-tests, comparing samples of speaker C from the beginningand the end. None of the parameters is significant.
103
C Results of Analysis 1 - Individual speakers
Speaker F
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 203.54158 19 16.156682 3.706597end 194.64532 19 9.858687 2.261738
Pair 2 F0 SD beginning 21.15971 21 7.641377 1.667485end 21.98543 21 8.624827 1.882092
Pair 3 F0 Min beginning 131.64883 18 40.341146 9.508499end 101.65800 18 15.192292 3.580858
Pair 4 F0 Max beginning 282.84862 21 40.089906 8.748335end 266.61476 21 28.084610 6.128564
Pair 5 Jitter beginning 1.97938 21 0.553131 0.120703end 2.21362 21 0.450859 0.098386
Pair 6 Shimmer beginning 7.50048 21 1.780894 0.388623end 7.31910 21 1.224337 0.267172
Pair 7 HNR beginning 20.26786 21 2.586816 0.564490end 20.03333 21 1.729479 0.377403
Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker F from the beginnings and ends of the dialogue with speaker C.
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 Mean 8.896263 19.409138 4.452762 1.998 18 0.061Pair 2 F0 SD -0.825714 11.686230 2.550144 -0.324 20 0.749Pair 3 F0 Min 29.990833 37.749193 8.897570 3.371 17 0.004Pair 4 F0 Max 16.233857 52.550037 11.467358 1.416 20 0.172Pair 5 Jitter -0.234238 0.782830 0.170828 -1.371 20 0.186Pair 6 Shimmer 0.181381 2.025149 0.441924 0.410 20 0.686Pair 7 HNR 0.234524 3.193329 0.696842 0.337 20 0.740
Results of the paired t-test, comparing samples of speaker F from the beginning andthe end within the dialogue with speaker C. F0 Min changed significantly, changesof F0 Mean showed a tendency for being significant.
Summary
Speaker F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR
CF (∗) ∗
Significant changes of voice correlates for the speakers C and F in their dialogue.Significant values are marked by ∗, tendencies are marked by (∗).
104
C.3 Dialogue H-F
C.3 Dialogue H-F
Speaker H
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 197.06055 20 6.948634 1.553762end 194.00780 20 8.411587 1.880888
Pair 2 F0 SD beginning 24.68400 21 13.229052 2.886816end 19.05981 21 6.245378 1.362853
Pair 3 F0 Min beginning 138.45991 22 32.500408 6.929110end 137.51386 22 33.081591 7.053019
Pair 4 F0 Max beginning 287.31795 22 40.004453 8.528978end 273.86282 22 29.113741 6.207070
Pair 5 Jitter beginning 2.20095 22 0.587641 0.125286end 2.41505 22 0.707500 0.150840
Pair 6 Shimmer beginning 6.99855 22 1.291574 0.275364end 7.73232 22 1.708574 0.364269
Pair 7 HNR beginning 19.15148 21 1.655982 0.361365end 18.83214 21 1.524352 0.32641
Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker H from the beginnings and ends of the dialogue with speaker F.
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 Mean 3.052750 10.513544 2.350900 1.299 19 0.210Pair 2 F0 SD 5.624190 11.547481 2.519867 2.232 20 0.037Pair 3 F0 Min 0.946045 53.930192 11.497956 0.082 21 0.935Pair 4 F0 Max 13.455136 43.213473 9.13144 0.082 21 0.159Pair 5 Jitter -0.214091 1.078386 0.229913 -0.931 21 0.362Pair 6 Shimmer -0.733773 2.405927 0.512945 -1.431 21 0.167Pair 7 HNR 0.319333 2.311162 0.504337 0.633 20 0.534
Results of the paired t-tests, comparing samples of speaker H from the beginningand the end. The changes in F0 SD are statistically significant.
105
C Results of Analysis 1 - Individual speakers
Speaker F
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 200.52427 22 14.815574 3.158691end 195.57645 22 11.038573 2.353432
Pair 2 F0 SD beginning 15.51300 22 3.901668 0.31838end 14.27959 22 2.294422 0.489172
Pair 3 F0 Min beginning 155.35519 21 26.608971 5.806554end 160.54162 21 11.656008 2.543549
Pair 4 F0 Max beginning 258.92655 22 19.105151 4.073232end 257.80345 22 614.800031 3.155377
Pair 5 Jitter beginning 2.32586 22 0.623597 0.132951end 2.28414 22 0.699479 0.149129
Pair 6 Shimmer beginning 7.77027 22 1.504202 0.320697end 7.31000 22 2.131832 0.454508
Pair 7 HNR beginning 19.29195 22 2.195504 0.468083end 19.49045 22 2.408319 0.513455
Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker F from the beginnings and ends of the dialogue with speaker H.
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 Mean 4.947818 16.404760 3.497507 1.415 21 0.172Pair 2 F0 SD 1.233409 4.264649 0.909226 1.357 21 0.189Pair 3 F0 Min -5.186429 31.237738 6.816633 -0.761 20 0.456Pair 4 F0 Max 1.123091 22.03912 4.712570 0.238 21 0.814Pair 5 Jitter 0.041727 0.830614 0.177088 0.236 21 0.816Pair 6 Shimmer 0.460273 2.489135 0.530685 0.867 21 0.396Pair 7 HNR -0.198500 2.754264 0.587211 -0.338 21 0.739
Results of the paired t-test, comparing samples of speaker F from the beginning andthe end within the dialogue with speaker H. No significant changes were found.
Summary
Speaker F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR
H ∗F
Significant changes of voice correlates for the speakers H and F in their dialogue.Significant values are marked by ∗, tendencies are marked by (∗).
106
C.4 Dialogue J-F
C.4 Dialogue J-F
Speaker J
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 229.58946 24 16.292234 3.325638end 219.89042 24 15.174251 3.097431
Pair 2 F0 SD beginning 35.57646 24 8.541754 1.743578end 29.15771 24 10.220472 2.086245
Pair 3 F0 Min beginning 121.84750 24 30.776830 6.282294end 147.47408 24 28.601457 5.838248
Pair 4 F0 Max beginning 352.53571 24 30.629027 6.252124end 324.47083 24 42.039957 8.581370
Pair 5 Jitter beginning 2.31348 23 0.513965 0.107169end 2.17283 23 0.292467 0.060984
Pair 6 Shimmer beginning 9.29271 624 1.309628 0.267327end 9.51163 24 1.721053 0.351308
Pair 7 HNR beginning 16.30183 24 1.567090 0.319881end 16.72579 24 2.049451 0.418343
Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker J from the beginnings and ends of the dialogue with speaker F.
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 Mean 9.699042 23.403858 4.777292 2.030 23 0.054Pair 2 F0 SD 6.418750 15.408817 3.145312 2.041 23 0.053Pair 3 F0 Min -25.626583 42.231507 8.620470 -2.973 23 0.007Pair 4 F0 Max 28.064875 60.295057 12.307677 2.280 23 0.032Pair 5 Jitter 0.140652 0.573483 0.119579 1.176 22 0.252Pair 6 Shimmer -0.218917 2.267400 0.462831 -0.473 23 0.641Pair 7 HNR -0.423958 2.833680 0.578423 -0.733 23 0.471
Results of the paired t-tests, comparing samples of speaker J from the beginning andthe end. Changes in of F0 Min and F0 Max were statistically significant, changesin F0 Mean and F0 SD showed a tendency of being significant.
107
C Results of Analysis 1 - Individual speakers
Speaker F
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 212.21375 20 12.542785 2.804652end 210.07010 20 18.061822 4.038746
Pair 2 F0 SD beginning 18.57710 20 4.973111 1.112021end 23.24375 20 10.579260 2.365595
Pair 3 F0 Min beginning 163.30184 19 23.999646 5.505896end 141.98226 19 39.002431 8.947771
Pair 4 F0 Max beginning 279.9396 20 25.73868 5.75534end 273.04720 20 30.686576 6.861727
Pair 5 Jitter beginning 1.87755 20 0.529078 0.118305end 1.92160 20 0.579673 0.129619
Pair 6 Shimmer beginning 6.45684 19 1.037244 0.237960end 7.15842 19 2.204843 0.505826
Pair 7 HNR beginning 20.27668 19 2.183106 0.500839end 20.45516 19 2.247926 0.515710
Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker F from the beginnings and ends of the dialogue with speaker J.
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 Mean 2.143650 23.914759 5.347503 0.401 19 0.693Pair 2 F0 SD -4.666650 13.009558 2.909026 -1.604 19 0.125Pair 3 F0 Min 21.319579 45.457736 10.428720 2.044 18 0.056Pair 4 F0 Max 6.892350 44.333690 9.913314 0.695 19 0.495Pair 5 Jitter -0.044050 0.835138 0.186743 -0.236 19 0.816Pair 6 Shimmer -0.701579 2.775553 0.636756 -1.102 18 0.285Pair 7 HNR -0.178474 3.163230 0.725695 -0.246 18 0.809
Results of the paired t-tests, comparing samples of speaker F from the beginningand the end within the dialogue with speaker J. No significant changes were found.The parameter of F0 Min showed a tendency for changing significantly.
Summary
Speaker F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR
J (∗) ∗ ∗ ∗F (∗)
Significant changes of voice correlates for the speakers J and F in their dialogue.Significant values are marked by ∗, tendencies are marked by (∗).
108
C.5 Dialogue K-F
C.5 Dialogue K-F
Speaker K
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 192.37771 21 16.378680 3.574121end 196.43390 21 27.215643 5.938940
Pair 2 F0 SD beginning 25.92719 21 8.759352 1.911447end 27.96395 21 9.911624 2.162894
Pair 3 F0 Min beginning 135.80510 21 20.086390 4.383210end 125.86948 21 21.441889 4.679004
Pair 4 F0 Max beginning 281.20576 21 47.503958 10.366213end 296.79795 21 48.854073 10.660833
Pair 5 Jitter beginning 2.29376 21 0.518357 0.113115end 2.53286 21 0.538768 0.117569
Pair 6 Shimmer beginning 7.14919 21 1.407683 0.307182end 7.48424 21 1.097045 0.239395
Pair 7 HNR beginning 18.56224 21 1.911997 0.417232end 17.84895 21 1.584266 0.345715
Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker K from the beginnings and ends of the dialogue with speaker F.
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 Mean -4.056190 35.836419 7.820148 -0.519 20 0.610Pair 2 F0 SD -2.036762 14.710574 3.210110 -0.634 20 0.533Pair 3 F0 Min 9.935619 31.118958 6.790713 1.463 20 0.159Pair 4 F0 Max -15.592190 67.937001 14.25069 -1.052 20 0.305Pair 5 Jitter -0.239095 0.30475 0.159403 -1.500 20 0.149Pair 6 Shimmer -0.335048 1.755003 0.382973 -0.875 20 0.392Pair 7 HNR 0.713286 2.170863 0.473721 1.506 20 0.148
Results of the paired t-tests, comparing samples of speaker K from the beginningand the end. No significant changes were found.
109
C Results of Analysis 1 - Individual speakers
Speaker F
Parameter Position Mean N SD St. Error Mean
Pair 1 F0 Mean beginning 198.34315 20 19.254552 4.305449end 193.27120 20 14.046104 3.140804
Pair 2 F0 SD beginning 24.10635 20 6.962382 1.556836end 18.62115 20 6.889519 1.540543
Pair 3 F0 Min beginning 124.67215 20 34.129144 7.631509end 140.08795 20 26.756725 5.982985
Pair 4 F0 Max beginning 280.48785 20 35.513617 7.941086end 262.12050 20 19.617772 4.386667
Pair 5 Jitter beginning 2.23680 20 0.409762 0.091626end 1.97180 20 0.494078 0.110479
Pair 6 Shimmer beginning 7.29220 20 1.118385 0.250079end 6.88630 20 0.863303 0.193040
Pair 7 HNR beginning 19.18380 20 1.560370 0.348909end 20.46240 20 1.818307 0.406586
Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker F from the beginnings and ends of the dialogue with speaker K.
Mean SD Std. t df Sign.Error (2-tailed)Mean
Pair 1 F0 Mean 5.071950 25.386650 5.676628 0.893 19 0.383Pair 2 F0 SD 5.485200 9.063702 2.026705 2.706 19 0.014Pair 3 F0 Min -15.415800 39.907564 8.923603 -1.728 19 0.100Pair 4 F0 Max 18.367350 49.718813 11.117465 1.652 19 0.115Pair 5 Jitter 0.265000 0.683095 0.152745 1.735 19 0.099Pair 6 Shimmer 0.405900 1.329755 0.297342 1.365 19 0.188Pair 7 HNR -1.278600 2.649123 0.592362 -2.158 19 0.044
Results of the paired t-tests, comparing samples of speaker F from the beginningand the end within the dialogue with speaker K. The parameters of F0 SD and HNRchanged statistically significant.
Summary
Speaker F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR
KF ∗ ∗
Significant changes of voice correlates for the speakers K and F in their dialogue.Significant values are marked by ∗, tendencies are marked by (∗).
110
D Results of Analysis 3 -Position in dialogue
D.1 Dialogue A-F
F0 Mean
Position Speaker Mean N SD Std. Error Mean
beginning A 198.8632 24 11.60327 2.36851beginning F 205.6947 23 13.93698 2.90606
end A 202.67775 24 13.600890 2.776270end F 204.61132 22 13.096973 2.792284
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themean fundamental frequency (F0 Mean) in Hertz (Hz) of the speakers A and F atthe beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 3828384.652 1 3828384.652 22024.892 0.000Position 45.321 1 45.321 0.261 0.611Error 15817.694 91 173.821
Results of the ANOVA for the mean fundamental frequency (F0 Mean) for thespeakers A and F at the beginning and at the end of the dialogue.
111
D Results of Analysis 3 - Position in dialogue
F0 SD
Position Speaker Mean N SD Std. Error Mean
beginning A 40.9575 24 13.74137 2.80495beginning F 27.3065 23 12.30905 2.56661
end A 37.54388 24 14.190584 2.896641end F 16.94927 22 14.190584 1.309774
Mean, standard deviation (SD) and standard error mean (St. Error Mean) forthe standard deviation of the fundamental frequency (F0 SD) in Hertz (Hz) of thespeakers A and F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 89280.516 1 89280.516 403.055 0.000Position 1007.422 1 1007.422 4.548 0.036Error 20157.385 91 221.510
Results of the ANOVA for the standard deviation of the fundamental frequency (F0SD) for the speakers A and F at the beginning and at the end of the dialogue.
F0 Min
Position Speaker Mean N SD Std. Error Mean
beginning A 97.5000 23 5.72999 1.19479beginning F 130.2837 23 29.63934 6.18023
end A 98.86592 24 18.902999 3.858559end F 150.67495 22 29.876391 6.369668
Mean, standard deviation (SD) and standard error mean (St. Error Mean) forminimum of the fundamental frequency (F0 SD) in Hertz (Hz) of the speakers Aand F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 1297736.842 1 1297736.842 1295.122 0.000Position 2187.481 1 2187.481 2.183 0.143Error 90181.691 90 1002.019
Results of the ANOVA for the minimal fundamental frequency (F0 Min) for thespeakers A and F at the beginning and at the end of the dialogue.
112
D.1 Dialogue A-F
F0 Max
Position Speaker Mean N SD Std. Error Mean
beginning A 305.7183 24 50.88190 10.38622beginning F 297.3401 23 48.56891 10.12732
end A 306.25752 23 55.679642 11.610008end F 266.68143 21 22.423106 4.893123
Mean, standard deviation (SD) and standard error mean (St. Error Mean) formaximum of the fundamental frequency (F0 SD) in Hertz (Hz) of the speakers Aand F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 7883534.879 1 7883534.879 3377.766 0.000Position 4614.282 1 4614.282 1.977 0.163Error 207721.482 89 2333.949
Results of the ANOVA for the maximal fundamental frequency (F0 Max) for thespeakers A and F at the beginning and at the end of the dialogue.
Jitter
Position Speaker Mean N SD Std. Error Mean
beginning A 2.5173 24 0.59054 0.12054beginning F 1.9739 23 0.45043 0.09392
end A 2.31452 23 0.378722 0.078969end F 1.94086 22 0.440796 0.093978
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for jitterin percent (%) of the speakers A and F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 441.683 1 441.683 1605.531 0.000Position 0.329 1 0.329 1.194 0.277Error 24.759 90 0.275
Results of the ANOVA for jitter for the speakers A and F at the beginning and atthe end of the dialogue.
113
D Results of Analysis 3 - Position in dialogue
Shimmer
Position Speaker Mean N SD Std. Error Mean
beginning A 9.5035 23 1.51940 0.31682beginning F 8.4913 24 1.45664 0.29734
end A 8.70875 24 1.415823 0.289004end F 8.20800 22 1.314742 0.280304
Mean, standard deviation (SD) and standard error mean (St. Error Mean) forshimmer in percent (%) of the speakers A and F at the beginning and the end ofthe dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 7101.120 1 7101.120 3281.770 0.000Position 6.751 1 6.751 3.120 0.081Error 196.907 91 2.164
Results of the ANOVA for shimmer for the speakers A and F at the beginning andat the end of the dialogue.
HNR
Position Speaker Mean N SD Std. Error Mean
beginning A 15.7147 24 1.72543 0.35220beginning F 18.0405 23 2.00603 0.41829
end A 16.78892 24 1.438183 0.293568end F 18.39323 22 1.868531 0.398372
Mean, standard deviation (SD) and standard error mean (St. Error Mean) forharmonics-to-noise ratio (HNR) of the speakers A and F at the beginning and theend of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 27524.410 1 27524.410 6751.830 0.000Position 11.500 1 11.500 2.821 0.096Error 370.969 91 4.077
Results of the ANOVA for the harmonics-to-noise ratio (HNR) for the speakers Aand F at the beginning and at the end of the dialogue.
114
D.2 Dialogue D-F
D.2 Dialogue D-F
F0 Mean
Position Speaker Mean N SD Std. Error Mean
beginning D 229.59724 25 12.807842 3.805483beginning F 217.70765 20 20.719282 4.285673
end D 232.93950 22 15.574690 2.730641end F 210.66168 22 20.719282 4.417366
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for meanof the fundamental frequency (F0 Mean) in Hertz (Hz) of the speakers D and F atthe beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 4243734.096 1 4243734.096 10705.279 0.000Position 13.651 1 13.651 0.034 0.853Error 33298.867 84 396.415
Results of the ANOVA for the mean fundamental frequency (F0 Mean) for thespeakers D and F at the beginning and at the end of the dialogue.
F0 SD
Position Speaker Mean N SD Std. Error Mean
beginning D 20.42756 25 5.950503 1.190101beginning F 39.25525 20 11.712071 2.618899
end D 28.94427 22 12.411122 2.646060end F 22.28627 22 8.167158 1.741244
Mean, standard deviation (SD) and standard error mean (St. Error Mean) forthe standard deviation of the fundamental frequency (F0 SD) in Hertz (Hz) of thespeakers D and F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 64157.271 1 64157.271 433.591 0.000Position 249.983 1 249.983 1.689 0.197Error 12429.256 84 147.967
Results of the ANOVA for the standard deviation of the fundamental frequency (F0SD) for the speakers D and F at the beginning and at the end of the dialogue.
115
D Results of Analysis 3 - Position in dialogue
F0 Min
Position Speaker Mean N SD Std. Error Mean
beginning D 181.33683 24 18.383121 3.752439beginning F 97.02540 20 6.125833 1.369778
end D 154.76977 22 50.631920 10.794762end F 136.24964 22 37.642780 8.025468
Mean, standard deviation (SD) and standard error mean (St. Error Mean) forminimum of the fundamental frequency (F0 Min) in Hertz (Hz) of the speakers Dand F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 1725662.578 1 1725662.578 860.341 0.000Position 731.843 1 731.843 0.365 0.547Error 166480.491 83 2005.789
Results of the ANOVA for the minimal fundamental frequency (F0 Min) for thespeakers D and F at the beginning and at the end of the dialogue.
F0 Max
Position Speaker Mean N SD Std. Error Mean
beginning D 308.76296 25 33.107201 6.621440beginning F 321.69335 20 60.458571 13.518947
end D 330.85964 22 31.889633 6.798893end F 295.33745 22 58.697028 12.514248
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themaximum of the fundamental frequency (F0 Max) in Hertz (Hz) of the speakers Dand F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 8397015.871 1 8397015.871 3529.055 0.000Position 25.107 1 25.107 0.011 0.918Error 199869.191 84 2379.395
Results of the ANOVA for the maximal fundamental frequency (F0 Max) for thespeakers D and F at the beginning and at the end of the dialogue.
116
D.2 Dialogue D-F
Jitter
Position Speaker Mean N SD Std. Error Mean
beginning D 1.90984 25 0.414307 0.082861beginning F 1.84015 20 0.447757 0.100121
end D 1.85718 22 0.377235 0.080427end F 1.90209 22 0.463344 0.098785
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for jitterin percent (%) of the speakers D and F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 305.859 1 305.859 1711.968 0.000Position 0.004 1 0.004 0.022 0.883Error 15.007 84 0.179
Results of the ANOVA for jitter for the speakers D and F at the beginning and atthe end of the dialogue.
Shimmer
Position Speaker Mean N SD Std. Error Mean
beginning D 7.08792 25 1.194246 0.238849beginning F 7.18625 20 1.548080 0.346161
end D 6.75695 22 0.938097 0.200003end F 6.72559 22 1.303650 0.277939
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themaximum of the fundamental frequency (F0 Max) in Hertz (Hz) of the speakers Dand F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 4124.894 1 4124.894 2636.424 0.000Position 2.980 1 2.980 1.905 0.171Error 131.425 84 1.565
Results of the ANOVA for shimmer for the speakers D and F at the beginning andat the end of the dialogue.
117
D Results of Analysis 3 - Position in dialogue
HNR
Position Speaker Mean N SD Std. Error Mean
beginning D 18.44952 25 1.683333 0.336667beginning F 20.02490 20 1.841146 0.411693
end D 18.39091 22 1.756843 0.374560end F 20.40582 22 1.846650 0.393707
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theharmonics-to-noise ratio (HNR) in decibel (dB) of the speakers D and F at thebeginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 31987.091 1 31987.091 8019.834 0.000Position 0.990 1 0.990 0.248 0.620Error 335.034 84 3.988
Results of the ANOVA for the harmonics-to-noise ratio (HNR) for the speakers Dand F at the beginning and at the end of the dialogue.
D.3 Dialogue H-F
F0 Mean
Position Speaker Mean N SD Std. Error Mean
beginning H 196.24725 24 7.490674 1.529027beginning F 199.75152 23 14.941827 3.115586
end H 193.90176 21 8.212989 1.792221end F 195.57645 22 11.038573 2.353432
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themean of the fundamental frequency (F0 Mean) in Hertz (Hz) of the speakers H andF at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 3463309.973 1 3463309.973 29634.075 0.000Position 230.452 1 230.452 1.972 0.164Error 10284.488 88 116.869
Results of the ANOVA for the mean fundamental frequency (F0 Mean) for thespeakers H and F at the beginning and at the end of the dialogue.
118
D.3 Dialogue H-F
F0 SD
Position Speaker Mean N SD Std. Error Mean
beginning H 23.35976 25 12.573196 2.514639beginning F 15.53235 23 3.813092 0.795085
end H 19.05981 21 6.245378 1.362853end F 14.27959 22 2.294422 0.489172
Mean, standard deviation (SD) and standard error mean (St. Error Mean) forstandard deviation of the fundamental frequency (F0 SD) in Hertz (Hz) of thespeakers H and F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 29760.685 1 29760.685 442.629 0.000Position 203.453 1 203.453 3.026 0.085Error 5984.020 89 67.236
Results of the ANOVA for the standard deviation of the fundamental frequency (F0SD) for the speakers H and F at the beginning and at the end of the dialogue.
F0 Min
Position Speaker Mean N SD Std. Error Mean
beginning H 139.00796 25 32.293084 6.458617beginning F 154.93026 23 25.446447 5.305951
end H 137.51386 22 33.081591 7.053019end F 160.54162 21 11.656008 2.543549
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theminimum of the fundamental frequency (F0 Min) in Hertz (Hz) of the speakers Hand F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 1979162.971 1 1979162.971 2389.789 0.000Position 102.187 1 102.187 0.123 0.726Error 73707.546 89 828.175
Results of the ANOVA for the minimal fundamental frequency (F0 Min) for thespeakers H and F at the beginning and at the end of the dialogue.
119
D Results of Analysis 3 - Position in dialogue
F0 Max
Position Speaker Mean N SD Std. Error Mean
beginning H 285.37520 25 38.816999 7.763400beginning F 259.05248 23 18.675661 3.894145
end H 273.86282 22 29.113741 6.207070end F 257.80345 22 14.800031 3.155377
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themaximum of the fundamental frequency (F0 Max) in Hertz (Hz) of the speakers Hand F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 6659341.869 1 6659341.869 7746.204 0.000Position 1102.196 1 1102.196 1.282 0.261Error 77372.189 90 859.691
Results of the ANOVA for the maximal fundamental frequency (F0 Max) for thespeakers H and F at the beginning and at the end of the dialogue.
Jitter
Position Speaker Mean N SD Std. Error Mean
beginning H 2.24668 25 .574052 0.114810beginning F 2.30130 23 0.620540 0.129392
end H 2.41505 22 0.707500 0.150840end F 2.28414 22 0.699479 0.149129
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for jitterin percent (%) of the speakers H and F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 490.512 1 490.512 1180.659 0.000Position 0.135 1 0.135 0.325 0.570Error 37.391 90 0.415
Results of the ANOVA for jitter for the speakers H and F at the beginning and atthe end of the dialogue.
120
D.3 Dialogue H-F
Shimmer
Position Speaker Mean N SD Std. Error Mean
beginning H 7.10152 25 1.295600 0.259120beginning F 7.78578 23 1.471499 0.306829
end H 7.73232 22 1.708574 0.364269end F 7.31000 22 2.131832 0.454508
Mean, standard deviation (SD) and standard error mean (St. Error Mean) forshimmer in percent (%) of the speakers H and F at the beginning and the end ofthe dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 5131.221 1 5131.221 1830.863 0.000Position 0.193 1 0.193 0.069 0.793Error 252.236 90 2.803
Results of the ANOVA for shimmer for the speakers H and F at the beginning andat the end of the dialogue.
HNR
Position Speaker Mean N SD Std. Error Mean
beginning H 19.06325 24 1.634153 0.333570beginning F 19.29700 23 2.145163 0.447297
end H 18.90482 22 1.526171 0.325381end F 19.49045 22 2.408319 0.513455
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theharmonics-to-noise ratio (HNR) in decibel (dB) of the speakers H and F at thebeginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 33466.642 1 33466.642 8817.808 0.000Position 0.009 1 0.009 0.002 0.961Error 337.786 89 3.795
Results of the ANOVA for the harmonics-to-noise ratio (HNR) for the speakers Hand F at the beginning and at the end of the dialogue.
121
D Results of Analysis 3 - Position in dialogue
D.4 Dialogue J-F
F0 Mean
Position Speaker Mean N SD Std. Error Mean
beginning J 229.51 26 17.16 3.36beginning F 211.08 23 13.02 2.71
end J 219.89 24 15.17 3.10end F 210.07 20 18.06 4.04
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themean fundamental frequency (F0 Mean) in Hertz (Hz) of the speakers J and F atthe beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 4120925.946 1 4120925.946 13708.057 0.000Position 1073.366 1 1073.366 3.570 0.062Error 25252.141 84 300.621
Results of the ANOVA for the mean fundamental frequency (F0 Mean) for thespeakers J and F at the beginning and at the end of the dialogue.
FO SD
Position Speaker Mean N SD Std. Error Mean
beginning J 35.10 26 8.42 1.65beginning F 17.86 23 5.00 1.04
end J 29.16 24 10.22 2.09end F 23.24 20 10.58 2.37
Mean, standard deviation (SD) and standard error mean (St. Error Mean) forthe standard deviation of the fundamental frequency (F0 SD) in Hertz (Hz) of thespeakers J and F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 64117.605 1 64117.605 530.021 0.000Position 61.035 1 61.035 0.505 0.479Error 10161.631 84 120.972
Results of the ANOVA for the standard deviation of the fundamental frequency (F0SD) for the speakers J and F at the beginning and at the end of the dialogue.
122
D.4 Dialogue J-F
F0 Min
Position Speaker Mean N SD Std. Error Mean
beginning J 122.28 26 30.36 5.95beginning F 161.41 22 26.09 5.56
end J 147.47 24 28.60 5.84end F 143.44 20 38.52 8.61
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theminimal fundamental frequency (F0 Min) in Hertz (Hz) of the speakers J and F atthe beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 1719477.594 1 1719477.594 1487.231 0.000Position 1518.309 1 1518.309 1.313 0.255Error 97117.487 84 1156.161
Results of the ANOVA for the minimal fundamental frequency (F0 Min) for thespeakers J and F at the beginning and at the end of the dialogue.
F0 Max
Position Speaker Mean N SD Std. Error Mean
beginning J 349.95 26 30.90 6.06beginning F 276.54 23 26.07 5.44
end J 324.47 24 42.04 8.58end F 273.05 20 30.69 6.86
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themaximal fundamental frequency (F0 Max) in Hertz (Hz) of the speakers J and F atthe beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 8328125.156 1 8328125.156 4027.269 0.000Position 8902.044 1 8902.044 4.305 0.041Error 173706.416 84 2067.934
Results of the ANOVA for the maximal fundamental frequency (F0 Max) for thespeakers J and F at the beginning and at the end of the dialogue.
123
D Results of Analysis 3 - Position in dialogue
Jitter
Position Speaker Mean N SD Std. Error Mean
beginning J 2.30 26 0.51 0.10beginning F 1.84 23 0.50 0.10
end J 2.17 23 0.29 0.06end F 1.92 20 0.58 0.13
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for jitterin percent (%) of the speakers J and F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 364.782 1 364.782 1380.755 0.000Position 0.021 1 0.021 0.080 0.778Error 21.928 83 .264
Results of the ANOVA for jitter for the speakers J and F at the beginning and atthe end of the dialogue.
Shimmer
Position Speaker Mean N SD Std. Error Mean
beginning J 9.29 26 1.35 0.26beginning F 6.30 22 1.04 0.22
end J 9.51 24 1.72 0.35end F 7.14 20 2.15 0.48
Mean, standard deviation (SD) and standard error mean (St. Error Mean) forshimmer in percent (%) of the speakers J and F at the beginning and the end of thedialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 5825.449 1 5825.449 1304.304 0.000Position 1.944 1 1.944 0.435 0.511Error 370.705 83 4.466
Results of the ANOVA for shimmer for the speakers J and F at the beginning andat the end of the dialogue.
124
D.5 Dialogue K-F
HNR
Position Speaker Mean N SD Std. Error Mean
beginning J 16.46 26 1.90 0.37beginning F 20.58 23 2.25 0.47
end J 16.73 24 2.05 0.42end F 20.46 19 2.25 0.52
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theharmonics-to-noise ratio (HNR) in decibel (dB) of the speakers J and F at thebeginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 28323.508 1 28323.508 3352.753 0.000Position 1.186 1 1.186 0.140 0.709Error 701.170 83 8.448
Results of the ANOVA for the harmonics-to-noise ratio (HNR) for the speakers Jand F at the beginning and at the end of the dialogue.
D.5 Dialogue K-F
F0 Mean
Position Speaker Mean N SD Std. Error Mean
beginning K 192.83 22 16.13 3.44beginning F 198.34 20 19.25 4.31
end K 196.43 21 27.22 5.94end F 195.55 22 17.05 3.63
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themean fundamental frequency (F0 Mean) in Hertz (Hz) of the speakers K and F atthe beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 3255604.933 1 3255604.933 8007.169 0.000Position 5.908 1 5.908 0.015 0.904Error 33746.660 083 406.586
Results of the ANOVA for the mean fundamental frequency (F0 Mean) for thespeakers K and F at the beginning and at the end of the dialogue.
125
D Results of Analysis 3 - Position in dialogue
F0 SD
Position Speaker Mean N SD Std. Error Mean
beginning K 25.95 22 8.55 1.82beginning F 24.11 20 6.96 1.56
end K 27.96 21 9.91 2.16end F 19.92 22 9.57 2.04
Mean, standard deviation (SD) and standard error mean (St. Error Mean) forthe standard deviation of the fundamental frequency (F0 SD) in Hertz (Hz) of thespeakers K and F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 50843.983 1 50843.983 596.503 0.000Position 31.941 1 31.941 0.375 0.542Error 7074.648 83 85.237
Results of the ANOVA for the standard deviation of the fundamental frequency (F0SD) for the speakers K and F at the beginning and at the end of the dialogue.
F0 Min
Position Speaker Mean N SD Std. Error Mean
beginning K 136.09 22 19.65 4.19beginning F 124.67 20 34.13 7.63
end K 125.87 21 21.44 4.68end F 142.12 22 26.30 5.61
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theminimal fundamental frequency (F0 Min) in Hertz (Hz) of the speakers K and F atthe beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 1490237.973 1 1490237.973 2126.529 0.000Position 264.206 1 264.206 0.377 0.541Error 58165.103 83 700.784
Results of the ANOVA for the minimal fundamental frequency (F0 Min) for thespeakers K and F at the beginning and at the en of the dialogue.
126
D.5 Dialogue K-F
F0 Max
Position Speaker Mean N SD Std. Error Mean
beginning K 280.25 22 46.58 9.93beginning F 280.49 20 35.51 7.94
end K 296.80 21 48.85 10.66end F 268.95 22 38.98 8.31
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themean fundamental frequency (F0 Max in Hertz (Hz) of the speakers K and F at thebeginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 6732564.089 1 6732564.089 3548.234 0.000Position 101.859 1 101.859 0.054 0.817Error 157487.593 83 1897.441
Results of the ANOVA for the maximal fundamental frequency (F0 Max) for thespeakers K and F at the beginning and at the end of the dialogue.
Jitter
Position Speaker Mean N SD Std. Error Mean
beginning K 2.28 22 0.51 0.11beginning F 2.24 20 0.41 0.09
end K 2.53 21 0.54 0.12end F 2.02 22 0.53 0.11
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for jitterin percent (%) of the speakers K and F at the beginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 435.616 1 435.616 1557.211 0.000Position 0.002 1 0.002 0.007 0.934Error 23.218 83 0.280
Results of the ANOVA for jitter for the speakers K and F at the beginning and atthe end of the dialogue.
127
D Results of Analysis 3 - Position in dialogue
Shimmer
Position Speaker Mean N SD Std. Error Mean
beginning K 7.16 22 01.38 0.29beginning F 7.29 20 1.12 0.25
end K 7.48 21 1.10 0.24end F 7.04 22 1.03 0.22
Mean, standard deviation (SD) and standard error mean (St. Error Mean) forshimmer in percent (%) of the speakers K and F at the beginning and the end ofthe dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 4456.157 1 4456.157 3295.051 0.000Position 0.021 1 0.021 0.016 0.901Error 112.247 83 1.352
Results of the ANOVA for shimmer for the speakers K and F at the beginning andat the end of the dialogue.
HNR
Position Speaker Mean N SD Std. Error Mean
beginning K 18.49 22 1.90 0.40beginning F 19.18 20 1.56 0.35
end K 17.85 21 1.58 0.35end F 20.30 22 1.81 0.39
Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theharmonics-to-noise ratio (HNR) in decibel (dB) of the speakers K and F at thebeginning and the end of the dialogue.
Source Type III df Mean F Sig.Sum of Square
Intercept 30550.769 1 30550.769 8160.997 0.000Position 1.683 1 1.683 0.450 0.504Error 310.711 83 3.744
Results of the ANOVA for the harmonics-to-noise ratio (HNR) for the speakers Kand F at the beginning and at the end of the dialogue.
128