Accommodation of voice parameters in dialogue

Institut fur Maschinelle SprachverarbeitungUniversitat StuttgartPfaffenwaldring 5bD-70569 Stuttgart

Accommodation of voice parameters in dialogue

Melanie HoffmannMasterarbeit

Prufer: Prof. Dr. Grzegorz DogilBetreuerin: Dr. Natalie Lewandowski

Beginn der Arbeit: 28. Februar 2014Ende der Arbeit: 31. Juli 2014

Eigenstandigkeitserklarung

Hiermit versichere ich,

• dass ich die vorliegende Arbeit selbststandig verfasst habe,

• dass ich keine anderen als die angegebenen Quellen benutzt und allewortlich oder sinngemaß aus anderen Werken ubernommenen Aussagenals solche gekennzeichnet habe,

• dass die eingereichte Arbeit weder vollstandig noch in wesentlichen TeilenGegenstand eines anderen Prufungsverfahrens gewesen ist,

• dass ich die Arbeit weder vollstandig noch in Teilen bereits veroffentlichthabe und

• dass das elektronische Exemplar mit den anderen Exemplaren uberein-stimmt.

Stuttgart, den 31. Juli 2014

Melanie Hoffmann

3

Acknowledgements

Ich mochte gerne meiner Betreuerin Natalie Lewandowski danken - dafur, dassihre Tur immer offen stand, wenn ich Fragen hatte, fur ihre Geduld und Zeitund fur ihre sehr hilfreichen Kommentare und Tipps. Eine bessere Berteuerinkann man sich kaum wunschen!

A huge thank you goes to Charlie P., who read through my work. I’m sure hemade it more joyful to read.

Ich mochte außerdem meinen Eltern danken. Danke, dass ihr immer an michglaubt und fur eure Unterstutzung! Papa - danke, fur die vielen hifreichenGesprache. Mama - danke fur das Umsorgen und fur das Versprechen unserergemeinsamen Reise! Ich freue mich sehr darauf.

Lieben Dank auch an Jan W. fur das viele Mut zusprechen und Gluck wunschen,nicht nur in Bezug auf diese Arbeit. Du warst eine große Unterstutzung.

4

Contents

Abstract 13

1 Introduction 15

2 Accommodation in dialogue 172.1 Communication Accommodation Theory . . . . . . . . . . . . . 172.2 Phonetic convergence . . . . . . . . . . . . . . . . . . . . . . . . 22

3 The perception of voice 273.1 Listeners’ judgements from voice perception . . . . . . . . . . . 28

3.1.1 Physical characteristics of the speaker . . . . . . . . . . . 283.1.2 Psychological characteristics of the speaker . . . . . . . . 313.1.3 Social characteristics of the speaker . . . . . . . . . . . . 35

3.2 Voice and exemplar theory . . . . . . . . . . . . . . . . . . . . . 37

4 Method 434.1 The Praat Voice Report . . . . . . . . . . . . . . . . . . . . . . 434.2 Corpus analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Results 515.1 Analysis 1 - Individual speakers . . . . . . . . . . . . . . . . . . 525.2 Analysis 2 - Speaker F . . . . . . . . . . . . . . . . . . . . . . . 565.3 Analysis 3 - Position in dialogue . . . . . . . . . . . . . . . . . . 58

6 Discussion 656.1 Discussion of method . . . . . . . . . . . . . . . . . . . . . . . . 656.2 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . 69

6.2.1 Discussion of Analysis 1 - Individual speakers . . . . . . 696.2.2 Discussion of Analysis 2 - Speaker F . . . . . . . . . . . 726.2.3 Discussion of Analysis 3 - Position in dialogue . . . . . . 73

6.3 Changes in fundamental frequency . . . . . . . . . . . . . . . . 746.4 Temporal aspects . . . . . . . . . . . . . . . . . . . . . . . . . . 756.5 Engagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7 Conclusion and Outlook 77

Bibliography 79

5

Contents

Appendices 95

A Outliers - Individual speakers 97

B Outliers - speaker F 99

C Results of Analysis 1 - Individual speakers 101C.1 Dialogue A-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101C.2 Dialogue C-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103C.3 Dialogue H-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105C.4 Dialogue J-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107C.5 Dialogue K-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

D Results of Analysis 3 - Position in dialogue 111D.1 Dialogue A-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111D.2 Dialogue D-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115D.3 Dialogue H-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118D.4 Dialogue J-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122D.5 Dialogue K-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6

List of Tables

3.1 Possible judgements from the perception of voice. . . . . . . . . 293.2 Relations between acoustic parameters and emotions. . . . . . . 31

4.1 Sum of outliers - Individual speakers. . . . . . . . . . . . . . . . 494.2 Sum of outliers - Speaker F. . . . . . . . . . . . . . . . . . . . . 49

5.1 Paired samples statistics, Analysis 1- Speaker D. . . . . . . . . . 525.2 Paired t-tests, Analysis 1 - Speaker D. . . . . . . . . . . . . . . 535.3 Paired samples statistics, Analysis 1 - Speaker F. . . . . . . . . 535.4 Results of the paired t-tests, Analysis 1 - Speaker F. . . . . . . . 545.5 Summary of Analysis 1 - Dialogue D-F. . . . . . . . . . . . . . . 545.6 Summary of Analysis 1. . . . . . . . . . . . . . . . . . . . . . . 555.7 Paired samples statistics, Analysis 2 - Speaker F. . . . . . . . . 575.8 Paired t-tests, Analysis 2 - Speaker F. . . . . . . . . . . . . . . . 575.9 Summary of Analysis 2 - Speaker F. . . . . . . . . . . . . . . . . 575.10 Descriptive statistics, Analysis 3, Dialogue C-F, F0 Mean. . . . 585.11 Results of ANOVA of Analysis 3, Dialogue C-F, F0 Mean. . . . 595.12 Descriptive statistics, Analysis 3, Dialogue C-F, F0 SD. . . . . . 595.13 Results of ANOVA of Analysis 3, Dialogue C-F, F0 SD. . . . . . 595.14 Descriptive statistics, Analysis 3, Dialogue C-F, F0 Min . . . . . 605.15 Results of ANOVA of Analysis 3, Dialogue C-F, F0 Min. . . . . 605.16 Descriptive statistics, Analysis 3, Dialogue C-F, F0 Max. . . . . 615.17 Results of ANOVA of Analysis 3, Dialogue C-F, F0 Max. . . . . 615.18 Descriptive statistics, Analysis 3, Dialogue C-F, jitter. . . . . . . 615.19 Results of ANOVA of Analysis 3, Dialogue C-F, jitter. . . . . . 625.20 Descriptive statistics, Analysis 3, Dialogue C-F, shimmer. . . . . 625.21 Results of ANOVA of Analysis 3, Dialogue C-F, shimmer. . . . 635.22 Descriptive statistics, Analysis 3, Dialogue C-F, HNR. . . . . . . 635.23 Results of ANOVA of Analysis 3, Dialogue C-F, HNR. . . . . . 635.24 Summary of Analysis 3 - Position in dialogue . . . . . . . . . . . 64

6.1 Mean and standard deviation for jitter, shimmer and HNR. . . . 68

7

List of Figures

2.1 Interactive Alignment Model. . . . . . . . . . . . . . . . . . . . 242.2 Hybrid model of convergence. . . . . . . . . . . . . . . . . . . . 25

3.1 Perception-production loop in exemplar theory. . . . . . . . . . 41

4.1 Perturbation: jitter and shimmer. . . . . . . . . . . . . . . . . . 454.2 Dialogues of speaker F. . . . . . . . . . . . . . . . . . . . . . . . 474.3 Boxplot, outliers F0 Min. . . . . . . . . . . . . . . . . . . . . . . 48

6.1 Error of measurement for F0 Min in Praat. . . . . . . . . . . . . 666.2 Error of measurement for F0 Max in Praat. . . . . . . . . . . . 66

9

List of Abbreviations

ANOVA Analysis of varianceCAT Communication Accommodation TheoryCNN Cable News NetworkF0 Mean Mean fundamental frequencyF0 Min Minimum of the fundamental frequencyF0 Max Maximum of the fundamental frequencyGeCo German conversations corpusHNR Harmonics-to-noise ratioIAM Interactive Alignment ModelIQ Intelligence quotientMDVP Multi-Dimensional Voice ProgramNHR Noise-to-harmonics ratioRP Received pronunciationSAT Speech Accommodation TheorySD Standard deviationStd. Error Mean Standard Error MeanVOT Voice onset time

11

Abstract

Communication, according to Communication Accommodation Theory, is de-

pendent on social and situational factors. Due to the emotional states, the

amount of attention paid, the impression of the conversational partner and

the situation, speakers can become more similar or more dissimilar towards

their conversational partner in behaviour and speech, or they can even main-

tain their own style.

The aim of the present thesis was to investigate phonetic convergence, de-

fined as the increasing acoustic similarity of speech, in parameters of voice. The

intention was to identify those voice parameters that are sensitive to changes

due to accommodation towards the conversational partner.

Samples from six different dialogues of the GeCo corpus [SL13, SLD14], con-

sisting of German spontaneous speech of female speakers, were extracted from

the first and the last five minutes. Each sample was analysed for parameters of

voice (mean, minimum, maximum and standard deviation of the fundamental

frequency, jitter, shimmer and harmonics-to-noise ratio) using the Praat Voice

Report [BW14]. Evaluations of the results were made with paired t-tests and

ANOVAs. Additionally, ratings of the speakers for their conversational partner

about social attractiveness and competence were included in the evaluation of

the outcome.

Results revealed that the individual speakers changed on different voice pa-

rameters. Thus it can be assumed that accommodation is not a fully automatic

process, but is also dependent on the situation, on the conversational partner

and on speaker’s individual characteristics. Furthermore, analyses indicate

that parameters of the fundamental frequency are sensitive towards accom-

modation. Tendencies for convergence, divergence and maintenance of the

results were partly confirmed by speakers’ ratings about social attractiveness

and competence.

13

1 Introduction

Whenever we perceive the voice of another person, whether while reading, chat-

ting, screaming or even whispering, we draw information about the speaker.

We can make statements, about whether the person is male or female, if he

or she is young or old, in which mood or emotional states he or she is, if he

or she is ill and where he or she is from. The impressions we have are not

necessarily correct, but nevertheless lead to the drawing of an “auditory face”

[BFB04], that allows to recognize individuals, emotional states and aspects of

personality and heritage.

We can adjust our voice, according to the perceived acoustic parameters of

our conversational partner’s voice and the information that we draw from it.

The Communication Accommodation Theory (Chapter 2.1), referring

to behaviour which includes speech, states that convergence (becoming more

similar to the conversational partner), divergence (becoming more dissimilar to

the conversational partner) and maintenance (persisting in one’s own original

style) are socially motivated. Thereby we can define and express the social

distance to our conversational partner.

Phonetic convergence (Chapter 2.2) then refers to the acoustic charac-

teristics of the increasing similarity of speech. Speakers hereby adopt the

acoustic-phonetic features of the conversational partner. When dealing with

phonetic convergence the question arises, to what extent is convergence con-

trollable and conscious to the speaker? The Interactive Alignment Model

proposes that the link of perception and production is directly coupled and

that convergence is thus mostly automatic. This has been criticized due to

the fact that divergence and maintenance would not be possible since speakers

would only be able to converge to their conversational partner. A hybrid model

then assumes that convergence is driven automatically, but can be influenced

through various factors, including the speaker, the conversational partner and

the situation.

15

1 Introduction

Chapter 3, The perception of voice, deals with different speaker charac-

teristics that listeners can draw from the perception of voices. Their judge-

ments do not have to be necessarily correct, but nevertheless they are made,

as about physical, psychological and also social characteristics of the speaker.

These details and pieces of information about the speaker are stored together

with acoustic and lexical information in the mind of the listener. In order to

converge towards the conversational partner listeners have to remember and

reuse this information. This process can be well explained by Exemplar The-

ory, which proposes that every stimulus, such as different voice parameters,

can be stored in mind as a detailed trace and can then serve as the basis of

recognition and production.

The methods used in the present thesis are presented in Chapter 4, Method.

There, the Praat Voice Report is presented including the voice parameters

investigated: the mean, standard deviation, minimum and maximum of the

fundamental frequency as well as jitter, shimmer and the harmonics-to-noise

ratio. The corpus used is the GeCo corpus, which consists of dialogues of

about 25 minutes of spontaneous speech of female German native speakers.

In Chapter 5, which describes the Results, three different analyses were

conducted with the help of SPPS, using paired t-tests and ANOVAs. The first

analysis investigates the speakers’ individual differences within the dialogues,

the second investigates differences for one speaker participating in all six dia-

logues and the third sheds light on differences in the voice parameters when

comparing beginning and end points of the dialogues.

In the Discussion (Chapter 6) the results of the individual analyses are

discussed. For the first analyses, the analysis of the individual speakers, the

speakers’ ratings about their conversational partners’ social attractiveness and

competence were included. Chapter 7, Conclusion and outlook, summa-

rizes the findings and provides an outlook to future investigations concerning

phonetic convergence.

16

2 Accommodation in dialogue

In the interaction of dialogue partners processes of accommodation can occur,

including adjustments in the direction of the conversational partner. Commu-

nication Accommodation Theory (CAT), described in Chapter 2.1, gives ex-

planations and speakers’ motivations for variation in behaviour and in speech.

Phonetic convergence, see Chapter 2.2, refers to listeners’ perception of an

acoustic signal that can be broken down to a set of features, which can be

reused in production. In addition, the degree of automaticity of convergence

is examined with the help of the Interactive Alignment Model and a hybrid

model.

2.1 Communication Accommodation Theory

Communication is context-sensitive: people tend to speak differently depend-

ing on conversational partners, e.g. colleagues, one’s parents or children. Con-

nections and relationships between humans are very complex and “suited in

an array of dynamic human features and personal attributes, such as experi-

ences, developmental process, and social and personal identity orientations”

[PG08, p. 25]. The assumption, according to Communication Accommodation

Theory (CAT), is that interpersonal and intergroup relationships are mediated

and maintained through communication [GG98].

CAT was proposed in the 1970s named Speech Accommodation Theory

(SAT) [Gil73]. The theory thereby focussed on motivations of speakers that

cause individuals to change speech behaviour [Gil73, GTB73]. The expansion

of the scope, including relational, contextual and identity processes in inter-

action, then lead to a redefinition of the theory to Communication Accom-

modation Theory, which combines the areas of social psychology, sociology,

sociolinguistics and communication [SGLP01]. CAT proposes that speakers

17


use verbal as well as non-verbal communication to achieve a desired social dis-

tance between themselves and the conversational partners [SGLP01, PG04].

In other words, parameters like use of language, quality of voice, gestures, pos-

ture, body movements, physical proximity, eye contact and facial expressions

can be used to emphasize or reduce the social distance to the conversational

partner and can express social status differences, ethnic and group boundaries

as well as role or norm-specific behaviours [SGLP01, p. 34]. Therefore the

process of accommodation is assumed to be complex and context-sensitive.

Accommodation could be proved for diverse dimensions in verbal and non-

verbal behaviour. Individuals’ behaviour during interaction becomes similar;

for example foot shaking and face touching [CB99], facial expressions, smiling

and gestures [GO06, p. 295]. Also information density is found to become

more similar during dialogue [AJL87]. For verbal behaviour, accommodation

becomes, for example, obvious by the length, duration and frequency of pauses

[GH82, JF70], the duration of utterances [GH82, JF70], for accent [BG77] and

backchannels [SL12].

The goals of CAT are explaining speakers’ linguistic and behavioural choices,

in which ways speakers adjust their speech towards their conversational partner

and also in what ways speech is perceived, evaluated and reacted to [GG13].

In order to create, reduce or maintain social distance speakers use different

strategies, namely convergence, divergence and maintenance.

1. Convergence

Convergence describes a speaker’s adjustment of his or her speaking style

and behaviour to become more similar or synchronous to a conversa-

tional partner [SGLP01]. Therefore individuals adopt each other’s com-

municative behaviour and thus reduce the social distance. Convergence

“is typically associated with affiliation, social approval, compliance, and

communication effectiveness” [PG08, p. 19].

2. Divergence

During the process of divergence, speakers accentuate their individual

distinctiveness [BG77] and emphasize the differences between self and

other [GOG05, SGLP01]. Hereby they display antipathy and social dis-

approval for the conversational partner [SGLP01].

3. Maintenance

The strategy of maintenance of “attempted non-convergence and non-

18


divergence” [SGLP01, p. 35] describes the behaviour in which a person

persists in his or her original style. The communication behaviour of the

conversational partner is thereby not regarded by the speaker [GOG05].

Reasons for this might be the substantiation of the speaker’s identity or

autonomy without emphasizing it or another possibility might be a lack

of sensitivity [GO06, p. 297]. Maintenance is often regarded similar to

divergence [GO06].

Like divergence, the process of overaccommodation can have a negative ef-

fect for communication [PG08, p. 19]. Overaccommodation is defined as “a

category if miscommunication in which a participant perceives a speaker to

exceed the sociolinguistic behaviours deemed necessary for synchronized in-

teraction” [SGLP01, p. 38]. An example of overaccommodation is patronizing

speech (e.g. nurses speech to elderly clients [EN93] or medical students talking

to patients with disabilities [DBSA11]). Speech then consists of simplified use

of vocabulary and grammar and slow enunciation [PG08, p. 19].

Conversely underaccommodation “specifies communication environments where

speakers do not afford their listeners adequate rights or space in conversational

interaction” [CJ97, p. 243]. Thus a speaker is perceived as not interested in

the conversation or not willing to exert effort for it [SGLP01].

Another strategy similar to divergence is speech complementarity [SGLP01,

GCC91]. Thereby speech is modified in a way which “accentuates valued soci-

olinguistic differences between interlocutors occupying different roles” [SGLP01,

p. 35]. An example for that is a study by Hogg in which men and women

changed speech behaviour during dialogue with each other [Hog85]. Men were

likely to use more masculine sounding voices and women were likely to use

a more female and soft sounding voice than they did in dialogues in which

the conversational partners were of the same sex. This phenomenon can be

explained by traditional sex-role ideologies [GCC91]. A women for example

might try to gain a man’s approval and wants to seem attractive to him and

thus uses a soft voice, but still she might converge to his dialect and/or other

parameters. Thus convergence can be accompanied by speech complementar-

ity.

Accommodation can occur in several dimensions as unimodal or multimodal

accommodation [SGLP01]. Unimodal accommodation describes a change of

only one layer, whereas multimodal accommodation occurs on several layers.

19


Additionally, accommodation can occur to different degrees, partial (interac-

tion partners converge slightly to each other) or full (behaviour of interaction

partners matches exactly) [SGLP01]. Interactions can also be symmetrical

or asymmetrical [SGLP01, p. 37]. Symmetrical accommodation describes an

interaction in which interaction partners behave equally (e.g. two previously

unknown people that come to be workmates [GP75, p. 177]), whereas in

asymmetrical interaction partners do not (e.g. job candidate and interviewer

[GP75, p. 176]). Additionally, the direction of accommodation can be described

[SGLP01]. Unidirectional accommodation describes the process in which only

one interaction partner accommodates in his or her behaviour, mutual accom-

modation means that both interaction partners accommodate.

Another important distinction is whether convergence or divergence are di-

rected upwards or downwards [GP75, SGLP01, GO06]. Upward convergence

occurs when one speaker adapts the prestige speech patterns of his conver-

sational partner and thus becomes more similar to him or her. Gregory and

Webster, for example, found that the pitch of voice of Larry King, the talk show

host on the CNN (Cabel News Network) Larry King Live talk show, was found

to be an indicator for the social status of his guests [GW96]. While Larry King

would converge to high-status guests, lower-status guests tend to converge to

him. Contrary, downwards convergence implies that one conversational part-

ner converges to the other who possesses less prestigious patterns. Azuma

found occurrence of downwards convergence in speeches of Japan’s emperor

Hirohito after the Pacific War [Azu97]1. Upward divergence then “can be in-

terpreted as indicating the sender’s desire to appear superior to the receiver in

social status and competence” [GP75, p. 178]. Hence the speaker wants to be

recognized as having a higher social status. Also downwards movements exist

- downward divergence then describes the “emphasis of one’s low-prestige mi-

nority heritage” [GO06, p. 295] or “down-to-earthness [and] toughness” [GP75,

p. 178].

Next to form, degree and direction of accommodation, speakers’ motivation

plays a role in CAT. Giles and Powesland distinguish between two kinds of

factors that might affect the speech behaviour of individuals: endogenous fac-

tors and exogenous factors [GP75]. Endogenous factors deal with the speaker’s

“physiological and emotional states at the time of interaction” [GP75, p. 119],

1According to Shepard et al. this result could also reflect overaccommodation on side ofthe conversational partners of Hirohito, who showed upward convergence, while Hirihitoshowed downward convergence [SGLP01, p. 37]

20


so factors that are internal to the speaker. Anxiety, for example, can thus

influence a speaker’s speech rate and pronunciation and can cause vocal dis-

turbances as well. Exogenous features, on the other hand, “are external to the

sender but present in the immediate social situation” [GP75, p. 118], such as

aspects of topic and context 2.

Beside the internal state of an individual and the context and topic of the

conversation attention has also be paid to the conversational partner. De-

pending on whom individuals are talking to they adjust their speech and be-

haviour. When talking to an unknown person, one of the first clues to the

characteristics of the conversational partner is his or her physical appearance.

Thus one hypothesis is, that sex/gender [NNS02, Par06, Lew12, Bab12], race

[Bab12] and social status [GW96] might play a crucial role in accommodation

[GP75]. Up to now there is no clear evidence for these “macro social factors”

[ACG+11, p. 192] being significant. Furthermore, the experiment by Abrego-

Collier and colleagues showed that neither gender nor the sexual orientation

of the speaker were found to be significant, but the personal opinion about

the speaker, which is built situationally and influences the direction of accom-

modation [ACG+11]. These include speakers’ motivations like proposed in the

Similarity Attraction Hypothesis [Byr71], which states that individuals try to

be more similar to people whom they are attracted to. Other proposed moti-

vations are the need of the speaker to gain approval from the conversational

partner [SG82] and the speaker’s concern for arranging the conversation un-

problematically and smoothly [GGJ+95] and/or to accomplish mutual goals

(as joint project) [Cla96]. It is also possible that speaker’s intelligibility might

increase during the interaction [Tri60] and his or her desire to reduce the social

distance [SGLP01] and to achieve interpersonal liking [CB99, p. 901].

In addition, intervisibility influences accommodation/convergence. Differ-

ences in speech behaviour occurred when individuals were able to see the

person they are listening to. Babel found more convergence in vowel spec-

tra when the participants of her experiment could see a picture of the speaker

who produced the signal [Bab12] and Schweitzer and Lewandowski observed a

higher articulation rate, when individuals were able to see their conversational

partner they are talking to [SL13].

In the following chapter phonetic convergence is described, which is “the

2Due to the fact that topic can influence the individual’s emotional state the distinctionbetween endogenous and exogenous factors can be indistinct.

21


process in which a talker acquires acoustic characteristics of the individual

they are interacting with” [Bab09, p. 3].

2.2 Phonetic convergence

Phonetic convergence is defined as the increase of segmental and suprasegmen-

tal similarities in speech [Par06] (also called phonetic imitation or phonetic

accommodation [Bab09, Bab12]). Thereby individuals take in the acoustic

characteristics of their conversational partner [Bab09].

When investigating how conversational partners accommodate/converge,

one should also consider the link between perception and production [SF97].

A person has to perceive the speech of the conversational partner in order to

reuse its features in his or her own. It is an open question as to which features

a person relies on. Several studies have investigated individual features:

• VOT [Nie07, Nie10, SF97]

• speech rate [SL12, Web70]

• amplitude:

amplitude envelopes [Lew12]

amplitude contour [Gre90]

• intensity [GH82, Nat75, LH11]

• fundamental frequency [GW96, GDW97, BB11, LH11]

• vowel quality [Bab09, Par10, PGSK12]

• jitter and shimmer [LH11]

• harmonics [LH11]

It might also be possible that speakers perceive (and reuse) not only single fea-

tures, but also different combinations of features. Speaker-specific differences

can also occur. In order to understand how individuals are able to reuse the

perceived features, one has to deal with models of the link between perception

and production. These are discussed in Chapter 3 (The perception of voice).

When dealing with the perception and production of acoustic parameters an

important question that arises is, to what extent are processes of convergence

22


controllable by the speaker, to what extent are they automatic and what influ-

ence do social factors have? The Communication Accommodation Theory (see

Chapter 2.1) regards changes in behaviour and speech socially motivated and

thus probably subconscious [Bab09, p. 20], whereas the Interactive Alignment

Model by Pickering and Garrod assumes a direct coupling between language

perception and language production processes [PG04].

The model is based on the assumption of Dijksterhuis and Bargh that imi-

tation is purely based on perception and that “no motivation is required, nor

a conscious decision” [DB01, p. 32]. Therefore imitation is a natural and au-

tomatic process with the caveat that other processes can intervene and inhibit

the process of accommodation. Pickering and Garrod adopt this assumption

for speech behaviour (in dialogues) and assume that “production and compre-

hension become tightly coupled in a way that leads to the automatic alignment

of linguistic representations at many levels” [PG04, p. 170] (e.g. syntactic, se-

mantic and phonological levels). Due to the interconnection of these levels

the alignment of one level automatically leads to the alignment of other lev-

els. Agreement on these levels lead to a mutual understanding between the

conversational partners. Aligned situation models, which contain information

about “space, time, causality, intentionality and currently relevant individu-

als” [GP04, p. 8] are the results of the interaction of the conversational partners

[PG04].

Figure 2.1 displays the proposed channels of alignment (by horizontal, dashed

arrows). The “channels are direct and automatic” [PG04, p.177] according to

priming. Priming describes the process by which the “activation of a represen-

tation in one interlocutor leads to the activation of the matching representation

in the other interlocutor directly” [PG04, p. 177]. A representation used for

the purposes of comprehension can then be reused for production. Pickering

and Garrod call this parity [PG04, p. 177]. If parity exists the neuronal infras-

tructure for speaking and listening should be the same [MPG12]. Evidence for

this has been found by Menenti et al. [MPG12] and by Garnier and colleagues

[GLS13]. Proof for automatic and unconscious alignment was also found by

Lewandowski [Lew12]. In her experiments with dialogues of native and non-

native speakers, the native speakers were explicitly instructed not to converge

to the nonnative speakers, but the results indicate that they nevertheless did

so.

In summary, the IAM proposes that the link of perception and production

23


Figure 2.1: Interaction Alignment Model proposed by Pickering and Garrod[MPG12, p. 2]. Different schematic levels of comprehension and pro-duction are linked with each other. The dashed lines represent thechannels of alignment.

is direct and that alignment is an automatic process. The model has been

criticized due to the fact that no processes or steps exist that could counteract

automatic alignment. According to CAT three different strategies exist: con-

vergence, divergence and maintenance (in order to represent social distance).

If alignment is an automatic process, maintenance and divergence would con-

sequently not be possible because speakers would only be able to converge.

Indeed evidence has been found that linguistic knowledge has influence on

convergence. Nielsen found that participants in her experiment imitated an ex-

tended voice onset time (VOT) after listening to words with extended VOT on

the phoneme /p/, but they did not imitate reduced VOT [Nie07, Nie10]. She

suggests that imitation of the reduced VOT would have introduced phono-

logical ambiguity to the corresponding voiced plosives, while there were no

such ambiguities in imitating extended VOT. Babel also found that speakers

did not imitate all vowels heard from a model talker [Bab09]. These find-

ings indicate that phonetic imitation is not an automatic process, but a se-

lective one that can be modelled through linguistic features. In addition, rat-

ings of mutual attractiveness and/or liking have been found being influential

[ACG+11, Nat75, PG08, SL13].

Hence Krauss and Pardo emphasized the need for a hybrid model “in which

24


alignment or imitation derives from both the kinds of automatic processes they

[Pickering and Garrod] describe and processes that are more direct or reflec-

tive” [KP04, p. 203] so socially motivated and conscious changes in speech can

also be considered in combination with the unconscious and automatic aspects

of alignment. Lewandowski proposes such a hybrid model [Lew12], which co-

vers different kinds of factors that can influence convergence (see Figure 2.2).

Figure 2.2: Hybrid model of convergence, including automatic and subconsciousfactors [Lew12, p. 205].

In the proposed hybrid model automatic processes and those which are said

to be conscious or rather subconscious are merged. In addition, social as-

pects, the situational context and the speaker’s personality and abilities are

integrated. The model shows that the convergence mechanism automatically

yields convergence, but can be influenced or alleviated through the evalua-

tion of the dialogue partner, the situational context and the speaker himself,

including his personality, psychological features, linguistic prerequisites and

phonetic talent as it has been shown that more phonetically-talented speakers

converge more to their partners than less talented ones (in native-nonnative

dialogues) [Lew12]. Important are also the linguistic prerequisites that speak-

ers have concerning different languages and dialects [Lew12, p. 205]. Linguistic

structures are stored in the speaker as well. Individual differences (personality

and psychological features) of the single speakers form the frame for the de-

gree of convergence. The evaluation of the dialogue partner (e.g. social status

attractiveness, friendliness and sympathy), the situational context and social

goals like the need for social approval are the parameters on which a speaker

25


evaluates whether to converge, diverge or maintain.

Another important impact on convergence is attention, which “may adjust

the grain of perceptual resolution” [Par06, p. 2389]. If listeners are distracted

and/or do not listen carefully to their conversational partner, speech may not

be perceived in a detailed way. Also the process of the memory plays an

important role as perceived speech modalities and contents are stored there.

Experiments by Gordon and colleagues proved that indeed “attention plays a

role in the perception of phonetic segments and that the relative importance

of acoustic cues depends on the amount of attention that is devoted to the

speech stimulus” [GER93, p. 33].

26

3 The perception of voice

An utterance contains not only linguistic information “but also a great deal

of information for the listener about the characteristics of the speaker him-

self” [Lav68, p. 43]. Thus utterances can convey information from a speaker

to a listener on several layers as an abstract form, containing the structures

of a language, and a discrete form as the produced sounds. Many theories

therefore distinguish the concepts of language, the abstract system of gram-

mar, and speech, the vocal-motor-performance [KS11, p. 303] 1. The layer of

language contains several levels of linguistic information (such as e.g. phono-

logical, morphological, syntactic and semantic levels) which then represent the

content of an utterance [LP05, p. 203]. In other words, the speaker structures

and arranges the intended information, so that the listener can extract the in-

formation. The layer of speech, on the other hand, conveys information about

the speaker him- or herself (next to other parameters, e.g. facial expressions,

gestures, posture, etc.). Abercrombie refers to these conveyed features as in-

dexical properties which “[. . . ] may fulfil other functions which may sometimes

even be more important than linguistic communication, and which can never

be completely ignored” [Abe67, p. 5].

The two concepts of language and speech are carried out simultaneously -

this means that the speaker generates an acoustic signal, the listener receives

this signal and can then extract the information from both sources. Taking

the broad definition of voice2 into account the term voice includes the details

of the vocal fold motions and “the acoustic results of the coordinated action

of the respiratory system, jaw, lips and soft palate, both with respect to their

average values and the amount and pattern of variability in values over time”

[KS11, p. 6]. In this sense voice shares many similarities with the layer of

speech just described.

1Ferdinand de Saussure distinguishes langue et parole [Sau01]. John Laver distinguishesbetween the paralinguistic, extralinguistic and linguistic layers [Lav03].

2The narrow definition of voice is distinguished from speech and synonymously the termlaryngeal source, relating to the vocal fold vibrations exclusively [KS11, p. 5].

27


3.1 Listeners’ judgements from voice

perception

According to the perception of the conversational partner’s verbal an non-

verbal behaviour people are “obliged to make a continuous stream of judge-

ments [. . . ] about a wide spectrum of information” [Lav76/97, p. 31]. Lis-

teners are not only able to draw conclusions from the content of what their

conversational partner is conveying, but also get an impression of his or her

personal characteristics - such as his or her physical and psychological char-

acteristics and social attributes. These conclusions about the conversational

partner “shape our own behaviour into an appropriate relationship with him”

[Lav76/97, p. 31].

Table 3.1 shows some judgements that listeners can make when listening

to voices [KS11, p. 2]. Through the perception of voice listeners get impres-

sions about physiological characteristics of the speaker, such as age and height,

psychological characteristics, such as arousal and stress, as well as social char-

acteristics as the social status. These are not necessarily accurate [KS11, p. 1]

but nevertheless they affect subsequent interaction.

In the Chapters 3.1.1-3.1.3 below, characteristics of the speaker that listeners

can perceive are presented.

3.1.1 Physical characteristics of the speaker

Every vocal tract shows individual differences which leads to acoustically dis-

tinctive speech productions. These differences occur due to anatomical differ-

ences (such as size and shape of the vocal tract) from which listeners can draw

conclusions about the speaker’s age, sex, body size and body height. Krauss

and colleagues found evidence that listeners can identify a speaker (from two

photos shown) from his or her voice better than chance (76.5 %) and that

they can estimate age and height only a little less accurate as from a photo

[KFM02]. A study by Ryalls and colleagues showed that older people exhibit

shorter average positive VOT values for unvoiced plosives [RST04]. They ex-

plain these findings by decreased lung volumes that older speakers often have

and subsequently a lower speaking rate. Studies also showed the development

28

3.1 Listeners’ judgements from voice perception

Physical characteristics of the speakerAgeAppearance (height, weight, attractiveness)Dental/oral/nasal statusHealth statusVocal fatigueIntoxicationRace, ethnicitySexSmoker/non-Smoker

Psychological characteristics of the speakerArousal (relaxed, hurried)CompetenceEmotional status/moodIntelligencePersonalityPsychiatric statusStressTruthfulness

Social characteristics of the speakerEducationOccupationRegional originRole in conversational settingSocial statusSexual orientation

Table 3.1: Judgements listeners can make from voice perception [KS11, p. 2], mo-dified.

of the fundamental frequency for men and women with age are caused by

different ratios of hormones [TE95].

In discriminating races, the distinction between racial profiling, which is

“based on visual cues that result in the confirmation of or in speculation con-

cerning the racial background of an individual or individuals” [Bau00, p. 363],

and linguistic profiling, which is “based upon auditory cues that may be used

to identify an individual or individuals as belonging to a linguistic subgroup

within a given speech community, including a racial subgroup” [Bau00, p. 396],

are made.

Concerning linguistic profiling, evidence has been found that listeners in-

deed can extract information from acoustic cues. Newman and Wu asked

New Yorkers to identify race and national heritage from other New Yorkers

[NW11]. Participants then heard speech samples from Chinese, Korean, Euro-

29


pean, Latino and African Americans and categorized these into black, white,

Hispanic or Asian. Results indicated that judgements thereby were better

than chance. A phonetic analysis showed then that the different speakers dif-

fer in VOT, breathiness of voices, levels of productions of /E/ and /r/ as well

as in rhythm. Ryalls and colleagues also observed that participants with dif-

ferent ethnic backgrounds (Afro-American and Caucasian-American) differed

in their durations of positive and negative VOT [RZB97]. In addition, their

study showed that also sex plays a role on the durations of positive and neg-

ative VOT. Other typical differences for men and women is the fundamental

frequency. Typical values are 120 Hz for men and 210 Hz for women [TE95]3.

Along with age, sex, race and ethnicity, the health status of the speaker can

be heard, as for example vocal diseases of the organic types like polyps (lesion

on the anterior third of the vocal fold [WJM+04, p. 125]) or vocal fold nodules

(symmetric, small lesions occurring on both side of the vocal fold [WJM+04,

p. 125]). They can be caused by vocal overuse like singing, shouting and long

and loud usage of voice. Petrovic-Lazic and colleagues found evidence that

patients with polyps had different values for jitter, shimmer, variation of the

fundamental frequency and harmonics-to-noise ratio (and others) than speak-

ers without polyps and that these values improved after surgery [PLB+09].

Also the dental, oral and nasal status can be heard from voice because the

place and manner of articulation plays a role in sound production (of conso-

nants). Through the modification of the articulators (e.g. lips, teeth, parts

of the tongue, glottis) and also different manners of articulation (e.g. produc-

tion of nasals, plosives, fricatives) different sounds can be produced and it is

possible to deduce information about the status of the speech organs. During

a cold or an allergic reaction, for example, mucous membranes of the upper

respiratory system are swollen, leading to a nasal voice [Lav68, p. 47].

Vocal fatigue, defined as “negative vocal adaptation that occurs as a con-

sequence of prolonged voice use” [WM03, p. 22] such as singing or reading

out loud, also affects voice. Evidence shows that mean fundamental frequency

increases after long periods of reading out text [SSL94]. Reported effects on

jitter, shimmer and harmonics-to-noise ratio differ between studies and exper-

imental settings.

3Statements about the fundamental frequency and the standard deviation differ in litera-ture ([TE95]).

30


In addition, smoking affects the quality of voice. Several studies showed

that smoking leads to a lowered mean fundamental frequency of speakers

[SH82, MD87] and also differences in harmonics-to-noise ratio and shimmer

[Bra94]. Intoxication can also be perceived by listeners. Alcohol affects hu-

man cognitive, motor and sensory processes and thus leads also to changes

in speech and voice [KS11, p. 357]. Schiel found that listeners can detect al-

coholic intoxication better in female voices than in male voices and better in

read speech than in spontaneous speech [Schi11]. Hollien and colleagues found

that a raise in fundamental frequency, a slow speaking rate and nonfluencies

in speech can be observed for intoxicated participants [HDJ+01].

3.1.2 Psychological characteristics of the speaker

Studying the effects of emotion on voice is difficult because questions arise as

to what exactly emotions are, how many of them exist and what type of fine-

grained distinctions can be made (e.g. anger can be expressed very tempered

and controlled or uncontrolled like in a rage attack). A tendency for affecting

acoustic parameters has been shown in studies for the emotions of sadness,

fear, anger, happiness and boredom. A summary [KS11, p. 321] can be seen

in Table 3.2.

Sadness Fear Anger Joy/ BoredomHappiness

F0 slightly very much very much much lower ormean lower higher higher higher normal

F0 (sightly) wider or much much morerange more narrower wider wider monotone

monotone or normal

F0 downward normal or abrupt smooth, lessvariability/ inflections abrupt changes/ upward variabilitycontour changes/ upward changes

upward inflectionsinflections

Intensity quieter normal louder louder quieter

Speaking slightly much slightly faster or slowerrate lower faster faster slower

Spectral less high- more or more more lessslope frequency less

energy

Table 3.2: Relations between acoustic parameters and emotions, adopted from[KS11, p. 321].

31


The emotions mentioned in Table 3.2, such as sadness and joy/happiness,

can be divided into more active and more passive emotions which can be

distinguished through the level of arousal. Fear, anger, joy and happiness

can be grouped as activation emotions. They exhibit a higher fundamental

frequency (F0), more fundamental frequency variability, faster speech rate,

increased intensity and increases in high-frequency energy [KS11, p. 325]. It

was concluded that there is a direct relation between arousal and physiological

effect [Sch86, WS72]. Emotional arousal cause an increase in aspiration rate as

well as changes in articulation and phonation [Sch03, p. 229]. This leads to an

increase in subglottal pressure, F0 and intensity. Similarly speech durations

between breaths are shortened and the typical speaking rhythm of a speaker

may be altered [KS11, p. 325]. Also, stress as an emotional load, exhibits

a raise in F0 and its standard variation [WJE+02] as well as in amplitude

[SMA+82]. More relaxed emotions, like sadness or quiet happiness, induce a

decreased motor control and less articulatory precision [WS72, p. 1239], which

could lead to increased jitter and shimmer and changes in the contour of the

fundamental frequency [KS11, p. 325].

It can be assumed that acoustic measures are useful identifying arousal,

but do not distinguish between the different emotions. Evidence for this has

been found by Laukka and colleagues [LJB05]. In their experiment, model

talkers spoke a sentence with different emotions (anger, fear, disgust, happiness

and sadness). Afterwards several vocal cues were measured and additionally

listeners had to rate the samples heard on different dimensions (activation,

valence, potency and emotional intensity). The results of their experiment

indicate that cues of activation and emotional intensity largely overlap (e.g.

for high mean and maximum F0, large variability of F0, mean and standard

deviation of intensity). No relation was found between the ratings of the

listener (whether the sentence was positive or negative) when comparing to

the acoustical cues. Laukka et al. conclude that it is possible that they did

not capture all cues that listeners use for their judgements.

Goudbeek and Scherer found that valence and potency might nevertheless

be expressed in voice, although arousal is dominant for many acoustical param-

eters [GS10]. When level of arousal was low, valency was reflected in spectral

slope and variability of intensity and potency was expressed through shimmer.

A high level of arousal was reflected in spectral slope and intensity as, it was in

the low levels of arousal, and additionally in intensity level and spectral noise.

32


Potency in high levels of arousal was expressed through variability of inten-

sity, spectral shape and the level of fundamental frequency. These findings

indicate that valence and potency can be measured, although they are depen-

dent on arousal. Nevertheless individuals also seem to vary in their expression

of emotion [WS72].

Furthermore, there is evidence that verbal and emotional messages interact

in production. In the experiment of Nygaard and Queen participants were

exposed to spoken words with a happy, sad or neutral meaning (e.g. comedy

and cancer) [NQ08]. The tone of voice of the model talker thereby was congru-

ent or incongruent with the word’s meaning. Participants were able to repeat

the heard words more quickly when they were spoken in a tone of voice that

matched its meaning.

Additionally, there is evidence that the perception of emotions in voice is cul-

ture and language-specific [SBW01]. Scherer et al. let participants from nine

different countries in Europe, North America and Asia listen to speech samples

produced by German actors, spoken in different emotions (anger, sadness, joy,

fear and neutral tone). After listening to a sample participants assigned up

to two emotional labels for the sample heard. Overall, participants from this

experiment were able to infer the emotions with a degree of accuracy better

than chance. In addition, accuracy decreased with increasing language dissim-

ilarity from German. From these results, Scherer et al. concluded that there

are culture- and language-specific paralinguistic patterns which influence the

decoding process in the perception of the listeners.

Another judgement listeners make while listening to voices is about the

personality of a person. Personality itself is a broad concept [KS11, p. 343].

In the following the definition of Kreiman and Sidtis will be used, describing

emotion as a transient state and personality as “the more enduring (but by

no means stable or permanent) aspects of the manner in which individuals

respond to stimuli” [KS11, p. 342].

Evidence that listeners make judgements about the speaker’s personality

was found by Pear [Pea31]. He discovered the existence and importance of vo-

cal stereotyping. In his study judgements about personality were made from

voices of nine different speakers over the radio. Listeners were very accurate at

guessing the sex of the speakers (except for the eleven-year old child), their age

and partly accurate at guessing the profession of the speakers (especially for

the actor and the judge). Errors in guessing the profession of the speakers were

33


nevertheless more or less consistent. More than half of the listeners thought

that the first speaker, a police detective, was working on a farm and most of

them believed that the eighth speaker, an electrical engineer, worked in a man-

ual trade. Although the assumptions of the listeners were not always correct,

they drew conclusions from the voices. Pear thus considers stereotyping as an

important aspect of the perception of personality from voice.

Ko and colleagues tried to determine what information about the speaker,

especially about the speaker’s competence and his or her warmth, can be

inferred by listeners [KJS09]. In their experiment male and female speakers

had to read out resumes with stereotypically masculine or feminine content.

Ratings about competence (associated with assertive and decisive) were solely

affected by vocal femininity (not by sex or type of resume) - thereby voices

rated low in femininity were perceived as more competent. Warm voices,

associative with being supportive and caring, on the other hand, correlated

with high feminine voices. In addition, Ko et al. tested whether femininity

correlated with babyishness and found that vocal femininity had on overlap

with vocal cues of babyishness (associated with weakness and incompetence).

In a study by Berry and et al. similar results were achieved [BHL+94].

Listeners rated voice samples of five-year old children counting to ten according

to competence, leadership, dominance, honesty, warmth, attractiveness and

babyishness. Attractive voices suggested the personality to be competent and

warm and for boys’ voices additionally leading. Babyish voices of boys and

girls were associated with less competence, dominance and leading qualities,

but honesty and additionally warmth for the voices of boys.

Competence is also related to speaking rate - faster rates of speech were

associated with higher competence [SBS75] and social attractiveness [SBP83],

although it needs to be remarked that listeners may perceive speech rate rela-

tive to their own habitual speech rate [SBP83]. Thus judgements about com-

petence and social attractiveness might be dependent on both, speaker and

listener.

In an experiment of Fay and Middleton listeners judged speakers’ intelli-

gence according to read speech [FM40]. To draw a conclusions between the

judgements and the intelligence of the speakers the intelligence quotient (IQ)

of each speaker was measured. Results indicate that listeners were fairly re-

liable in their judgements but that they could judge the intelligence of some

speakers more reliable than those of others and that they rated more intelligent

34


speakers to be more intelligent in most of the cases. Another result was that

listeners seem to develop stereotypes of superior and inferior intelligence.

The studies described above show that listeners can consistently rate person-

ality traits from voices, although these might not always be accurate, and that

listeners draw stereotypes out of the quality of voices. Also McAleer et al.

found that personality judgements between judges of their experiment were

consistent when listening to the word ’hello’ produced by different speakers

[MATB14].

Additionally, there is evidence that psychiatric diseases often influence voice

parameters. Laukka et al. found that speakers who were afraid of speak-

ing in public due to social phobias exhibited a change in acoustic parameters

after treatment (mean and maximum fundamental frequency, high-frequency

components in the spectrum and pauses) [LLA+08]. Speakers with depression

tended to exhibit an increase in speech rate and a decrease in pausing with

proceeding treatment and mood change as well as a decrease in minimum fun-

damental frequency for women [ES96]. Listeners in the experiment of Todt

and Howell were able to significantly distinguish schizophrenic voices from

non-schizophrenic voices, describing them as more inefficient, despondent, and

moody [TH80]. Studies about the relations of psychiatric disease and voice

characteristics is a difficult field of study as individual differences, inconsistent

longitudinal changes and subjective diagnostic criteria complicate the studies

[KS11, p. 357].

3.1.3 Social characteristics of the speaker

Analyses showed that voices can be associated with social groups. In an ex-

periment of Moreau and colleagues Senegalese and European listeners heard

recordings of Senegalese Wolof speakers, a language mainly spoken in North

Senegal, and had to make statements about their social and caste status

[MTHH14]. Senegalese listeners were able to classify the social status better

than chance as well as European listeners, who did not have prior knowledge

about the Wolof language. The results of this experiment indicate that listen-

ers can, largely independent of knowledge about language structures and thus

relying on acoustic parameters, make statements about the social status of a

speaker.

35


Esling, who recorded male speakers in Edinburgh, found that vocal settings

(modal, creaky, whispery, breathy, harsh voice and combinations of them),

defined by Laver [Lav68], correlated with the socio-economic status of the

speaker [Esl78]. Creaky voices were associated with a higher social status

whereas harsh voices were associated with lower social status.

In the experiment by Gregory and Webster, who analysed speech with long-

term averaged spectra, found that accommodation was dependent on the social

status [GW96]. Speakers with a lower social status converged to speakers with

a higher social status. Dominance also played a role in the experiment of

Gregory and Gallagher [GG02]. They analysed the fundamental frequency of

US presidential candidates of eight elections and discovered that the candidate,

who did not converge in his fundamental frequency, was likely to win the

election. Next to dominance, the role in the conversational setting influences

speech. Pardo found that instruction givers, who had to explain a path in a

map task to another participant of the experiment, converged to the receivers

[Par06]. Still an open question is whether the sex of the speakers plays a

roles in the perception of speaker role as inconclusive results were obtained

regarding this matter [NNS02, PJK10, Lew12].

Linville proved in a perception experiment that female listeners were capable

of making accurate judgements regarding men’s sexual orientation (straight or

gay) [Lin98]. The acoustic analysis found evidence that the phoneme /s/ was

produced differently for straight and gay men. Although speakers tend to per-

ceive the sexual orientation of their conversational partner (or a model talker)

it does not seem to have any influence on accommodation. Abrego-Collier et

al. let participants of their experiment listen to a first-person narrative about

going on a date with different outcomes [ACG+11]. In the negative version

the speaker abandons his date and goes home alone, in the positive version

he goes to the date and they leave together. Additionally the narrative differs

in the sexual orientation of the speaker (“straight” and “gay” condition). In

the experiment VOT of plosives was extended in the recordings. The results

of the experiment indicate that participants with a positive opinion about the

speaker showed an increase in VOT after the experiment. Neither the outcome

of the date nor the sexual orientation are suggested to have had influence on

the participants.

Regional origins, as for example expressed by dialects or accents, can also be

perceived by the listener. Nasalisation characterizes most speakers of Received

Pronunciation (RP) in England, several accents from the United States of

36

3.2 Voice and exemplar theory

America and also Australia. Velarisation, on the other hand, functions as are

regional marker for speakers from Birmingham, England and parts of New

York [Lav68, p. 50].


As the previous chapters (Chapters 4.1.1-4.1.3) show, listeners are capable of

inferring several pieces of information about a speaker from voice perception.

The judgements they make are not necessarily accurate, but nevertheless they

lead to the formation of opinion about the model talker or conversational

partner, which guides further interaction.

In order to evaluate the conversational partner and to accommodate/converge

to his or her speech, listeners have to perceive (parameters of) their conversa-

tional partner’s speech. That means that listeners are able to adopt some of

the acoustic-phonetic features of their conversational partner. Therefore, there

has to be a link between the perception of the characteristics of the speech of

others and one’s own production. Many different models exist that propose

how perception and production are coupled. The present chapter deals with

the perception and storage of voice and voice details. These processes can be

well explained by exemplar theory.

In abstractionist theories the process of normalisation, mapping from speaker-

specific to a speaker-neutral abstraction, is assumed. Thereby voice informa-

tion would be discarded during speech perception and ideal, modality-free units

would be stored in the mental lexicon. According to the vocabulary previously

used, this would then mean that only language (including structure and con-

tent), excluding speech (information about the speaker himself, including voice

details), from the perceived acoustic signal would be kept in memory. Indeed

some experiments show tendencies that talker variability plays a role in mem-

ory, which contradicts the process of normalisation. Exemplar theory suggests

that individuals store detailed instances in memory and that they are able to

compare a new stimulus with these stored instances for perception.

Van Lancker et al. found that listeners were able to recognize voices, even

when played backwards (and thus without any language-specific and phonetic

information) [VLKE85]. Martin et al. had participants try to recall a word

list with monosyllabic English words in the right order [MMPS89]. The word

37


list was either produced by one speaker or by ten different ones (each word

spoken by different speaker). The result of this experiment was that the recall

was better for the participants that listened to the word list produced by one

speaker, but only for items that occurred early in the word list.

In a further experiment Martin et al. let participants run through an addi-

tional preload memory task. Participants were shown digits on a screen and

had to listen to the words list from the previous experiment with one of the

different conditions afterwards. The result of this experiment was that recall of

the digits was better if the word list that was heard afterwards was produced

by one speaker. Martin et al. suggested that the processing of the words

produced by different speakers requires more resources for working memory as

words produced by one speaker.

Goldinger et al. were able to prove some of the results of Martin et al.

[GPL91]. Participants in their experiments also had to recall of a word list

with ten items in the right order. The word lists were also produced by one

speaker or by different speakers (as in the experiment by Martin et al.). The

words were then presented to the participants at varying speeds. At relatively

fast presentation rates, the results are comparable to those of Martin et al.

The recall of the participants was better for the word lists produced by one

speaker than those produced by different speakers. In contrast, recall was bet-

ter for world lists produced by multiple speakers when they were played at slow

presentation rates. Goldinger et al. concluded that voice information (along

with lexical information) is retained in long-term memory when participants

are given sufficient time during rehearsal and that it facilitates the retrieval of

words.

These results of the study could also be proven by Lightfoot, who made

experiments with the familiarity of speakers’ voices [Lig89]. Participants of

her experiments were trained to recognize the voices of different speakers.

Different fictional names were associated with the voices. Word lists produced

by different speakers were then better recalled than word lists produced by

single speakers, even at relatively fast presentation rates. Nygaard et al. also

proved that familiarity with speaker’s voices facilitates the recognition of novel

words produced by familiar voices [NSP94]. They concluded that speaker-

specific information was encoded and retained in long-term memory.

Palmeri et al. also investigated the relationship between word recognition

and the memory of voices [PGP93]. In their experiment, old and new words

38


were presented to the participants, who decided if the word they heard was

new (played for the first time) or old (repetition of an already heard word).

The lag between the words was then manipulated. Up to 64 intervening words

were set between the repetitions of words. The words in the presented word

lists were also uttered by different speakers - participants heard 2, 6, 12 or 20

different voices in one trial (half male and half female). As in the previously

mentioned experiments, repetitions of words uttered by the same voice were

better recognized than those uttered by different voices (independent of sex).

The effect of increasing the number of different voices up to 20 had no effect

on the participants either. Thus it can be supposed that speakers do not

strategically encode voices, because otherwise the increase from 2 to 20 voices

should have impaired their ability to do so. Palmeri et al. thus propose

automatic voice encoding and that detailed voice information is part of the

representations of spoken words that are retained in long-term memory.

Goldinger then investigated how long voice-specific details remain in mem-

ory [Gol96]. To this end, he tested whether participants could still manage to

distinguish old and new words after delays of five minutes, one day and one

week. He also investigated if the advantage of words uttered by the same voice

still remains over time to prove whether same voice repetitions were still better

recognized than different voice repetitions. The results indicate that voice ef-

fects retained up to one week, but that they decrease over time (7.5 % after five

minutes, 4.1 % after one day, 1.6 % after one week). These findings indicate

that episodic traces do not only affect memory, but also influence later percep-

tion. Goldinger defines these episodic traces as “complex perceptual-cognitive

objects, jointly specified by perceptual forms and linguistic functions” [Gol96,

p. 1179]. Thus he supposes that words are recognized against a background of

detailed traces and that the mental lexicon can be viewed as episodic, which

means that memories decay.

Goldinger then tested if these traces affect not only later perception but

also speech production [Gol98]. Participants were asked to produce words

that were shown to them on a computer screen. The words had different fre-

quencies (high frequent, medium high frequent, medium low frequent and low

frequent), following the assumption that low frequent words are affected more

easily, because they are not represented by so many exemplars as high frequent

words [Pie01, Hin86]. Afterwards the participants listened to different speakers

who produced the same words several times (2, 6 or 12 repetitions). Partici-

39


pants then heard the words again and repeated them with different delays in

an immediate or delayed shadowing task. In a following AXB-perception test

the produced baseline words (A) were compared against the shadowed word

productions (B). Listeners therefore had to judge which of these productions

sounded more similar to the stimulus words in the shadowing task (X). Re-

sults indicate that participants sounded more like the samples that they were

exposed to and that indeed low frequency words, that are heard with a high

repetition rate, invoked strong imitation in the immediate shadowing condi-

tion. Thus Goldinger supposes different degrees of imitation and that variables

such as word frequency, number of exposures and response timing also play a

role.

It is worth noting that a distinction should be made between imitation

and convergence. “Imitation is a fully conscious and controlled action in a

controlled setting, whereas convergence happens rather naturally and without

full awareness or control” [Lew12, p. 79]. Nevertheless, one can assume that

similar tendencies exist for both.

Furthermore Goldinger explored the relationship between speech imitation

and attention [Gol13]. He had participants record baseline productions of

words. Afterwards they heard a speaker uttering one of these words and they

had to click on the related picture from a collection displayed on a screen. The

experiment had different conditions - in the first condition, the competitive

objects on the screen were dissimilar in visual and phonological form, in the

second, objects were phonologically similar (e.g. beetle, beater, beaker, beach-

ball, etc.), while in the third objects were visually similar (e.g. all objects

are rounded, examples are cookie, coin, pizza, etc.). In a following AXB-

perception test listeners then rated whether the produced baseline word or the

re-recorded words were more similar to the target word. Results indicate that

imitation increased when competitors (visual, phonological) were present and

that this increase is even stronger for phonologically similar objects. Goldinger

supposes that attention to the speech signal was modulated by the challenge

of the search task. If the participants needed to carefully monitor speech to

locate the appropriate targets, they created episodic traces very rich in detail

that supported better imitation.

The above described findings support the idea that the mental lexicon is

composed of detailed traces of past experiences. Thus, not abstract units, but

rather individual exemplars are stored, which serve then as the basis of recog-

40


Figure 3.1: Perception-production-loop in exemplar theory [Schw12, p. 52].When a stimulus is perceived, it is associated with similar productionsalready stored (can be regarded as category). In production a set ofstored exemplars is used to generate the production target.

nition and also for production (see Figure 3.1). Thereby each exemplar has

an associated strength. Exemplars from recent experiences and that occurred

more frequent are more vivid in memory than exemplars that are perceived

temporally remote and less frequent [Pie01].

For production Goldinger proposes that the mean of an activated set is se-

lected for production, which creates an “generic echo” [Gol97, p. 42]. Speaker-

specific information is encoded in episodic traces (next to lexical information,

e.g. words), including details about physical, psychological and social char-

acteristics (see Chapters 3.1.1-3.1.3). They are perceived and stored in the

memory of the listener and can be reused for production. The degree to which

this process is automatic, conscious and controllable is still an open question

(see Chapter 2.2). Also a “perfect” phonetic imitation of the conversational

partner’s speech is impossible due to the fact that speakers have limitations

on their abilities to do so, mainly due to the configuration of the articulatory

tract but also due to the fact that even the two productions of a single talker

are not acoustically identical [KP04, FBSW03].

According to CAT (see Chapter 2.1) accommodation then is a highly speaker-

and context-dependent process. Cooperative and social motivations, such as

approval of the conversational partner [SG82] or to arrange the conversation to

be unproblematic and smoothly [GGJ+95], lead to a more similar (speech) be-

haviour of the conversational partners. The usage-based account of exemplar

theory provides an explanation for the perception and production for imitative

phenomena. The perception of familiar voices and more frequent words lead

to more activated traces in memory and these sets of traces can then be used

for production [Gol97].

41


The following chapter describes parameters of voice which were investigated

in this thesis. Thereby the Praat Voice Report was used for samples extracted

from the GeCo corpus of spontaneous German conversations. The leading

hypothesis for the study was that conversational partners exhibit changes in

parameters of voice while having a dialogue. The speakers thus were expected

to react to the conversational partner and to situational aspects. These reac-

tions can be mirrored in the quality of voice. The study was supposed to show

on which parameters of voice (parameters of the fundamental frequency, jitter,

shimmer, harmonics-to-noise ratio) changes occur.

42

4 Method

In Chapter 4.1 the Praat Voice Report and the voice parameters of fundamental

frequency, perturbation and harmonicity are described. Further, in Chapter

4.2, the GeCo corpus and the analysis of the corpus are presented.

4.1 The Praat Voice Report

Voice analyses in this thesis was done with Praat [BW14]. The first step was

to extract samples from each speaker of the dialogues in which speaker F was

involved. Therefore a minimum of twenty samples was manually extracted

within each, the first and the last five minutes of the dialogue. The duration

of the samples depended on the duration of the utterance - they lasted from

1.401 to 6.997 seconds. Samples mostly had a length of between 3 and 4

seconds.

Afterwards, the Praat Voice Report was produced for every single sample. In

order to compare the individual samples with each other the pitch settings were

uniformly set: the pitch range was selected between 100 Hz and 450 Hz which is

a typical range for voices of women. Additionally, the analysis method was set

to cross-correlation to avoid errors in measurement [May13, p. 142]. The Praat

Voice Report offers 26 different (calculated) voice parameters in the areas of

pitch, pulses, voicing, jitter, shimmer and harmonicity (of the voiced parts).

Seven were included in the voice analysis of this thesis. They are described

below. Values of pulses and voicing were discarded in the analysis because the

analysis was taken on spontaneous speech and thus includes phonation breaks,

i.e. when voiceless sounds were produced or the speaker took a short pause

while speaking.

43

4 Method

Fundamental frequency (F0)

When sounds in human speech are voiced, the vocal folds in the larynx vi-

brate. These vibrations can be described as a complex quasi-periodic wave

[Joh03]. The fundamental frequency (F0), measured in Hertz (Hz), is then

the number of repetitions of this complex wave per second. Furthermore, the

fundamental frequency is the first and lowest frequency in the signal. It is

called quasi-periodic because in the strict mathematical sense the waves are

not perfectly periodic due to the not perfect identical periods. Nevertheless

the waves are periodic enough for the perception of a clear sound and the

identification of the fundamental frequency [May13, p.144]. Acoustically the

fundamental frequency is correlated with the perception of pitch in the voice.

The mean fundamental frequency (F0 Mean) was in the analysis.

The standard deviation of the fundamental frequency (F0 SD) is the second

value taken in the analysis of the fundamental frequency. It shows how much

the fundamental frequency deviates from the mean fundamental frequency (F0

Mean). The minimum (F0 Min) and the maximum (F0 Max), also included

in the present analyses, describe the range of voice of a speaker. A small

range stands for little variation, i.e. monotonous speech, and a large range

represents more variation, i.e. lively speech. The standard deviation is linked

to the minimum and the maximum.

Perturbation

Due to the quasi-periodicity of the sound waves, it is common for several

deviations in the duration, frequency and amplitude to occur with respect

to their neighbouring periods (perturbations). Deviations in the fundamental

frequency are called jitter and deviations in the amplitude are called shimmer,

which are declared in percent (%). To a certain extent, jitter and shimmer

are normal in the human voice. They can also be affected through influences

on the vocal folds (e.g. smoking affects shimmer values [Bra94]). Very high

values, which can be observed in pathological voices, lead to the impression of

a breathy, rough or hoarse voices [FHE07].

Different algorithms exist in Praat and other programs to calculate values

for jitter and shimmer. In general, the values for shimmer and jitter are better

the smaller they are [May13, p. 145]. The Multi-Dimensional Voice Program

44

4.1 The Praat Voice Report

Figure 4.1: Shimmer: perturbation of amplitude, jitter: perturbation of frequency[May13, p. 144] (modified).

(MDVP) measures jitter and shimmer with equal algorithms as Praat, except

for differences in the measurement on the previous step, the detection of pe-

riods in the acoustic signal. Praat is wave-form matching, while MDVP is

peak-picking [Boe09]. This difference leads to dissimilar results. MDVP pro-

poses a threshold for jitter (Jitt) of 1.04 % for the classification of healthy

and pathological voices. This threshold is not valid for calculations done with

Praat. The value for healthy voices for jitter (Jitter (local)) here probably has

to be lower than 1.0 % [May13, p. 145] 1.

The MDVP threshold value for shimmer (Shim) 3.81 % can also be cau-

tiously taken for the Praat value Shimmer (local) [May13, p. 156]. Nawka et

al. propose a value under 2,5 % [NFG06, p. 18]. Other factors, that may play

a major role for the values of jitter and shimmer, are the recording hardware

(e.g. microphone) and the sampling rate of the signal, which can also have an

influence on the calculation of the acoustic parameters [May13, p. 140].

These values are often valid for clear and well articulated speech, e.g. the

articulation of a vowel held for a few seconds. The participants of the experi-

ment produced spontaneous speech and thus the values for jitter and shimmer

are not expected to match the thresholds for pathological voices, but to be

higher instead.

The Praat Voice Report offers five different values for jitter (Jitter (local),

Jitter (local, absolute), Jitter (rap), Jitter (ppq5), Jitter (ddp)) and six different

values for shimmer (Shimmer (local), Shimmer (local, dB), Shimmer (apq3),

Shimmer (apq5), Shimmer (apq11), Shimmer (dda)). The difference between

1Mayer proposes values for healthy voices between 0.5 and 1 % [May13, p. 145],Nawka et al. propose values between 0.1 and 1 % [NFG06, p. 18].

45

4 Method

these values stems from their calculation with different algorithms. It is unclear

which value describes the phenomena best. For this thesis the values of Jitter

(local) and Shimmer (local) were chosen. Jitter(local) is the average absolute

difference between consecutive periods, divided by the average period [Boe11],

Shimmer(local) is the average absolute difference between the amplitude of the

consecutive periods, divided by the average amplitude [Boe03].

Harmonicity

Harmonicity is defined as the proportion of harmonic and non-harmonic (noisy)

parts of a signal [May13, p. 145]. Based on the assumption that an acoustic

signal can be divided into harmonic and non-harmonic parts, the noise-to-

harmonics ratio (NHR) (proportion of non-harmonics to harmonics) and the

harmonics-to-noise ratio (HNR) (proportion of harmonics to non-harmonics)

can be calculated. Bigger parts of non-harmonics are often perceived as aspi-

rated speech and hoarseness [YSO84, YWB82].

Next to NHR and HNR the Praat Voice Report gives the value of mean

autocorrelation, which measures the similarity between neighbouring periods

and returns the probability of their agreement [May13, p. 145]. In this thesis

the value of HNR (in decibel dB) is used, i.e. the degree of periodicity. The

threshold value for the classification of healthy and pathological voices is under

20 dB [MS08, p. 26]. A disadvantage of the HNR measure is the dependency

on minimal perturbations of frequency and amplitude (jitter and shimmer)

[Mur99, FMK98] 2.

4.2 Corpus analysis

The German conversations corpus (GeCo) was recorded by Schweitzer and

Lewandowski from the Institute for Natural Language Processing of the Uni-

versity of Stuttgart in 2013 [SL13, SLD14]. Previously unacquainted women

were asked to have a conversation with each other about topics of their choos-

ing. All of them were native speakers of German. The dialogues were then

recorded for about 25 minutes in a sound-attenuated room under two different

2Yumoto et al. suggested that jitter contributes to the magnitude of the noise componentsin the harmonics-to-noise-ratio (HNR) [YSO84].

46

4.2 Corpus analysis

conditions: a unimodal and a multimodal condition. In the unimodal condi-

tion the participants were not able to see each other but they were able to

listen and to talk to their conversational partner via head-set microphones.

In the second round, the multimodal condition, the participants were able to

see each other through a transparent screen. The analyses for this thesis were

done on the multi-modal condition of the dialogues, due to the fact that that

multimodality is known to increase convergence even in the speech modality

[DR10, SL13, Bab12]. Each of the the eight participants then had a total of

six dialogues with other participants. Thus a total of 24 dialogues was pro-

duced. For the analyses of this thesis the dialogues with speaker F and her

conversational partners were chosen (six dialogues, see Figure 4.2).

F J

KA

C

D H

Figure 4.2: Dialogues of speaker F with six different conversational partners.

After the recordings the participants filled in a questionnaire in which they

rated their conversational partners [SL13]. They had to make statements about

their impressions of the partner’s social attractiveness (likeable, kind, social,

relaxed) and her competence (intelligent, competent, successful, self-confident)

on a 5-point Likert scale. These values were then transformed to values from

-2 and 2 and added afterwords to a composite overall likeability score with a

range from -8 to 8. The values for social attractiveness and competence are

included in the discussion of results (Chapter 6.2.1).

For the analysis, samples of each speaker from each of the chosen dialogues

(every dialogue with speaker F) were manually extracted. Therefore a min-

imum of twenty samples were taken from each the first five minutes of the

dialogue as well as the last five minutes. This was done in order to compare

the voice quality from the beginning and the end of the dialogue. Samples

47

4 Method

which contained laughter, breathing or very glottalized speech were discarded.

Afterwards a Praat Voice Report was generated from each sample. The data

was then entered in IBM SPSS Statistics 21 [Inc12].

In order to detect outliers boxplots were generated (see Figure 4.3). Outliers

are defined as values that fall more than 1.5 to 3 box lengths from the upper

or lower hinge of the box. They are marked by a circle (◦). Extreme outliers

then are values that are more than 3 box lengths away from either hinge of

the box, marked by a star (∗) [JL13, p. 242]. Table 4.1 shows the sum of

the outliers and extreme outliers found for the beginnings and ends of the

individual speakers.

Figure 4.3: The boxplot represents the values of the minimal fundamental fre-quency from the beginning of the dialogue of speaker F with speakerJ. It includes the sample 2 with the error in measurement (see Figure6.1) which is classified as an extreme outlier.

Overall, extreme outliers were discarded in the analyses. A different proce-

dure in sorting out the outliers was made in the sample collections for each

speaker in the individual dialogues (about 20 samples for beginning/end for

each speaker) and the sample collection of speaker F (about 130 samples for

beginning/end). This means that some samples that are discarded for speaker

F in the individual dialogues are included in the sample collection of speaker F

for all dialogues. The reason for discarding outliers separately was that speaker

F might have shown an abnormal value in one of the parameters in a single

48

4.2 Corpus analysis

dialogue, but that this value matches the values that speaker F produced in

all dialogues. Thus more accurate analyses might be achieved. An overview of

the outliers can be seen in Table 4.1 and Table 4.2. Detailed tables of outliers

are shown in Appendix A (Outliers - Individual speakers) and Appendix B

(Outliers - Speaker F).

Voice parameter F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR

Outliers (◦) 15 8 17 20 21 17 11Extreme (∗) 5 1 11 1 3 1 2outliers

Total 20 9 28 21 24 18 13

Table 4.1: Sum of outliers (identified by boxplots) from the sample collections fromthe beginnings and ends of each individual speaker. A detailed table canbee seen in Appendix A.

Voice parameter F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR

Outliers (◦) 6 5 0 11 8 2 3Extreme (∗) 0 0 0 4 1 0 0outliers

Total 6 5 0 15 9 2 3

Table 4.2: Sum of outliers (identified by boxplots) from the sample collections fromthe beginnings and ends of speaker F. Samples were taken from all sixdialogues in which speaker F was involved. A detailed table can beeseen in Appendix B.

49

5 Results

Three statistical analyses were done with IBM SPSS Statistics 21 [Inc12]. Dif-

ferences were calculated with the help of paired t-tests and ANOVAs (analysis

of variance). Thereby values below or equal 0.05 are considered to be signifi-

cant. The goals of the analyses were the following:

• Analysis 1: Individual speakers

Samples from the beginning and the end for each speaker in each dialogue

were compared with each other with the help of paired t-tests. Significant

parameters were supposed to have changed over the course of the dialogue

(according to accommodation to the conversational partner). In this

analysis the individual speakers in the different dialogues were in the

focus. The goal was to find out which voice parameters changed in each

dialogue for each speaker.

• Analysis 2: Speaker F

All samples that speaker F uttered in the beginnings of the six dialogues

were compared with all samples from the ends with the help of paired t-

tests. The goal of this analysis was to identify those parameters speaker

F changed in all dialogues she had with different conversational partners.

• Analysis 3: Position in dialogue

Samples from the conversational partners in the individual dialogues were

compared between the beginning and the end of the conversation with the

help of ANOVAs. It is supposed that some parameters have changed from

the beginning and the end of the dialogue in order of accommodation.

In the following chapters (Chapters 5.1-5.3) the results of the analyses are

described.

51

5 Results

5.1 Analysis 1 - Individual speakers

In this section, samples from the beginnings and ends of the dialogues are

compared with each other for every individual speaker within the dialogues.

Paired t-tests were used to show which parameters changed for the individual

speakers during the dialogues. In the following the speakers F and D in their

conversation are analysed. The results of the other five dialogues are shown in

the summarizing Table 5.6.

Speaker D

Table 5.1 shows the descriptive statistics for the mean values of speaker D in

the dialogue with F. In Table 5.2 the results of the paired t-tests for speaker D

at the beginning and the end of the dialogue with speaker F are presented.

Parameter Position Mean N SD St. Error Mean

Pair 1 F0 Mean beginning 231.44841 22 19.494032 4.156142end 232.93950 22 12.807842 2.730641

Pair 2 F0 SD beginning 20.93773 22 6.026298 1.284811end 28.94427 22 12.411122 2.646060

Pair 3 F0 Min beginning 182.06523 22 18.481850 3.940344end 154.76977 22 50.631920 10.794762

Pair 4 F0 Max beginning 313.69659 22 31.624643 6.742397end 330.85964 22 31.889633 6.798893

Pair 5 Jitter beginning 1.83905 22 0.375192 0.079991end 1.85718 22 0.377235 0.080427

Pair 6 Shimmer beginning 7.05686 22 1.272993 0.271403end 6.75695 22 0.938097 0.200003

Pair 7 HNR beginning 18.75591 22 1.531032 0.326417end 18.39091 22 1.756843 0.374560

Table 5.1: Mean, standard deviation (SD) and standard error mean for the com-parison of the samples of speaker D from the beginnings and ends of thedialogue with speaker F.

Results show that speaker D changed on the variability of F0 (see Table 5.2)

as significant changes were found for F0 SD (0.021), F0 Min (0.029) and F0

Max (0.049). The mean values (see Table 5.1) show that F0 Min decreased

between beginning and end (from 182.07 Hz to 154.77 Hz) and F0 Max in-

creased (from 313.70 Hz to 330.86 Hz). As a consequence also the value of F0

SD increased from (20.94 Hz to 28.94 Hz).

52


Mean SD Std. t df Sign.Error (2-tailed)Mean

Pair 1 F0 Mean -1.491091 18.110810 3.861238 -0.386 21 0.703Pair 2 F0 SD -8.006545 15,100058 3.219343 -2.487 21 0.021Pair 3 F0 Min 27.295455 54.519820 11.623665 2.348 21 0.029Pair 4 F0 Max -17.163045 38.508415 8.210022 -2.090 21 0.049Pair 5 Jitter -0.018136 0.422668 0.090113 -0.201 21 0.842Pair 6 Shimmer 0.299909 1.602436 0.341641 0.878 21 0.390Pair 7 HNR 0.365000 2,283732 0.486893 0.750 21 0.462

Table 5.2: Results of the paired t-tests, comparing samples of speaker D from thebeginning and the end. Significant changes were found for F0 SD, F0Min and F0 Max.

Speaker F

The results of speaker F, comparing samples from the beginning and the ends of

the conversation with speaker D, are shown in Table 5.4. Similar as for speaker

D the values of F0 SD (0.000) and F0 Min (0.001) were shown to be highly

significant. Mean values from Table 5.3 show that F0 Min increased from the

beginning to the end (from 97.03 Hz to 133.15 Hz) and (as a consequent effect)

F0 SD decreased (from 39.26 Hz to 22.53 Hz).









Table 5.3: Mean, standard deviation (SD) and standard error mean for the com-parison of the samples of speaker F from the beginnings and ends of thedialogue with speaker D.

53

5 Results


Pair 1 F0 Mean 6.944250 22.846937 5.108730 1.359 19 0.190Pair 2 F0 SD 16.724850 13.373057 2.990306 5.593 19 0.000Pair 3 F0 Min -36.124700 41.604984 9.303157 -3.883 19 0.001Pair 4 F0 Max 26.091250 69.397917 15.517846 1.681 19 0.109Pair 5 Jitter -0.107100 0.600917 0.134369 -0.797 19 0.435Pair 6 Shimmer 0.368650 2.094771 0.468405 0.787 19 0.441Pair 7 HNR -0.240500 2.701613 0.604099 -0.398 19 0.695

Table 5.4: Results of the paired t-tests, comparing samples of speaker F from thebeginning and the end within the dialogue with speaker D. Significantchanges of F0 SD and F0 Min were found.

Table 5.5 shows a summary of the results of the paired t-tests compar-

ing samples from the beginning of speaker D and speaker F. Both exhibited

changes in the values of F0, namely F0 SD and F0 Min. Speaker D also showed

significant changes in F0 Max. While F0 variation increased for speaker D, it

decreased for speaker F.

Speaker F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR

D ∗ ∗ ∗F ∗ ∗

Table 5.5: Significant changes of voice correlates for the speakers D and F in theirdialogue. Significant values are marked by ∗.

54


Summary of the results

Table 5.6 shows a summary of the results of the paired t-tests for all dialogues.

Samples from the beginning and the end for each speaker within the single

dialogues were compared. Detailed results of the analyses can be found in

Appendix C.


A (∗) ∗F ∗ ∗ ∗CF (∗) ∗D ∗ ∗ ∗F ∗ ∗H ∗F

J (∗) (∗) ∗ ∗F (∗)

KF ∗ ∗

Table 5.6: Significant changes of voice correlates for each speaker within the singledialogues. Significant values are marked by ∗, tendencies are marked by(∗).

In the dialogue between the speakers A and F HNR showed to be signifi-

cant (0.026) for speaker A. From the beginning to the end of the dialogue it

increased from 15.71 dB to 16.79 dB. Shimmer exhibited a tendency towards

being significant (0.066). It decreased from 9.50 % to 8.71 %. Speaker F had

significant changes in the values of F0, namely F0 SD (0.004), F0 Min (0.020)

and F0 Max (0.007). As F0 Min increased (from 129.02 to 150.67 Hz) and F0

Max decreased (from 299.24 Hz to 266.68 Hz), F0 SD decreased (from 27.24 Hz

to 16.95 Hz). The conversational partners did not exhibit changes on the same

parameters. Detailed results of the paired t-tests can be seen in Appendix

C.1.

Speaker C in the dialogue with speaker F showed no significant changes at

all. Speaker F changed on values of F0. F0 Min changed significantly (0.004)

as it decreased (from 131.65 Hz to 101.66 Hz) and F0 Mean showed a tendency

towards being significant (0.061) as it also decreased (from 203.54 Hz to 194.65

Hz). In Appendix C.2 detailed results are presented.

55

5 Results

In the dialogue of speaker H and F only speaker F changed significantly on

the parameter of F0 SD (0.037) as it decreased (from 15.51 Hz to 14.28 Hz).

Detailed results of the paired t-tests are shown in Appendix C.3.

Speaker J in the dialogue with F exhibited changes on the values of F0. F0

Mean (0.054) and F0 SD (0.053) showed a tendency towards being significant

and F0 Min (0.007) and F0 Max (0.032) proved to be significant. As F0 Min

increased (from 121.85 Hz to 147.47 Hz) and F0 Max decreased (from 352.53 Hz

to 324.47 Hz), F0 SD decreased (35.58 Hz to 29.16 Hz). F0 Mean also decreased

from (229.59 Hz to 219.89 Hz). Speaker F, in the same dialogue, exhibited a

tendency for a significant change on the parameter of F0 Min (0.056). It

decreased (from 163.30 Hz to 141.98 Hz). Appendix C.4 shows detailed results

for the values of speaker J and F.

In the dialogue between the speakers K and F only speaker F significantly

changed on F0 SD (0.014) and HNR (0.044). F0 SD decreased (from 24.11 Hz

to 18.62 Hz) and HNR increased (19.18 db to 20.46 Hz). Detailed results are

presented in Appendix C.5

The results of the analyses of the single speakers within dialogues, summa-

rized in Table 5.6, show that there is on overall tendency for F0 towards being

significant, especially F0 SD and F0 Min (as they both have been significant

five times each and showed a tendency towards being significant once). Also

there have been significant changes in F0 Max (three times) and HNR (two

times). Shimmer and F0 Mean showed a few tendencies towards being sig-

nificant (F0 Mean two times, Shimmer one time), whereas jitter was never

significant. Conversational partners did not changed on the same voice pa-

rameters (e.g. in the dialogue of the speakers A and F). In addition, speaker

F, involved in all six dialogues, mainly changed on F0, but did not change in

every dialogue.

5.2 Analysis 2 - Speaker F

The second analysis was done with the help of paired t-tests. Samples from

the beginning and end from speaker F from all dialogues were taken (resulting

in 137 samples for the beginning and 129 for the end; less discarded extreme

outliers). The descriptive statistics in Table 5.7 shows the mean values for

each voice parameter for the beginnings and ends for all six dialogues in which

56

5.2 Analysis 2 - Speaker F

speaker F was involved. The results of the paired t-tests can be seen in Ta-

ble 5.8 and a summary of the results in Table 5.9. Calculating the difference

between voice parameters from the beginning and the end showed on which

voice parameters speaker F changed.









Table 5.7: Mean, standard deviation (SD) and standard error mean for the com-parison of the samples of speaker F from the beginnings and the endsof the dialogues. All values from all six dialogues were taken.


Pair 1 F0 7.230597 23.929435 2.106869 3.432 128 0.001Pair 2 F0 SD 6.781465 13.449280 1.184143 5.727 128 0.000Pair 3 F0 Min -7.132411 47.969186 4.223451 -1.689 128 0.094Pair 4 F0 Max 26.441448 49.541430 4.431120 5.967 124 0.000Pair 5 Jitter 0.026789 0.735625 0.065021 0.412 127 0.681Pair 6 Shimmer 0.200124 2.199984 0.193698 1.033 128 0.303Pair 7 HNR -0.426535 3.063233 0.269703 -1.582 128 0.116

Table 5.8: Results of the paired t-tests, comparing parameters from samples fromthe beginning and end of speaker F of all dialogues. F0 Mean, F0 SDand F0 Max have shown to be significant.


F ∗ ∗ ∗

Table 5.9: Significant changes of voice correlates for speaker F. Significant valuesare marked by ∗.

The overall analyses for speaker F showed that she changed significantly on

F0, namely F0 Mean (0.001), F0 SD (0.000) and F0 Max (0.000) (all values

57

5 Results

decreased). The parameters of F0 Min, Jitter, Shimmer and HNR did not

prove to be significant.

5.3 Analysis 3 - Position in dialogue

The following analysis shows the influence of the position (beginning and end)

of the dialogue. Therefore one-way ANOVAs (analysis of variance) were con-

ducted. For every single voice parameter an analysis was done, in which the

voice parameter was the dependent variable and the position (beginning or end

of the dialogue) the fixed factor. Detailed results are shown for the dialogue

of the speakers C and F. The other results are then presented in a summary.

Detailed values of the ANOVAs are shown in Appendix D.

Dialogue C-F

Table 5.10 shows the mean values (Mean), the standard deviation (SD) and the

Standard Error Mean (St. Error Mean) for the mean fundamental frequency

measurements of the dialogue of the speakers C and F.

F0 Mean

Position Speaker Mean N SD Std. Error Mean

beginning C 216.67610 29 16.455194 3.055653beginning F 203.86271 21 15.634375 3.411700

ending C 218.16552 29 15.574690 2.892147ending F 194.73200 20 9.603569 2.147423

Table 5.10: Mean, Standard deviation (SD) and Standard Error Mean for the meanfundamental frequency (F0 Mean) in Hertz (Hz) of the speakers C andF at the beginning and at the end of the dialogue.

The mean values of F0 Mean for both speakers, speaker C and speaker F, did

not vary much from the beginning and end in their dialogue. Speaker C showed

an F0 Mean value of 216.77 Hz at the beginning and a value of 218.17 Hz at

the end, which is a minimal change of 1.4 Hz. Speaker F had a F0 Mean

value of 203.86 Hz at the beginning and a value of 194.73 Hz at the end of the

dialogue. The difference was 9.13 Hz.

58


Source Type III df Mean F Sig.Sum of Square

Intercept 4321380.850 1 4321380.850 14043.360 0.000Position 189.073 1 189.073 0.614 0.435Error 29540.834 96 307.717

Table 5.11: Results of the ANOVA for F0 Mean for the speakers C and F at thebeginning and at the end of the dialogue.

The result of the ANOVA in Table 5.11 show that the values for F0 Mean

did not changed significantly from the beginning and the end of the dialogue

(0.435).

F0 SD

Table 5.12 shows mean values (Mean), the standard deviation (SD) and the

Standard Error Mean (St. Error Mean) for the standard deviation of F0 SD.



ending C 32.05234 29 10.207541 1.895493ending F 21.98543 21 8.624827 1.882092

Table 5.12: Mean, Standard deviation (SD) and Standard Error Mean for the stan-dard deviation of the fundamental frequency (F0 SD) in Hertz (Hz) ofthe speakers C and F at the beginning and at then end of the dialogue.

Speaker C showed a mean value for F0 SD of 33.43 Hz at the beginning of

the dialogue and a value of 32.05 Hz at the end of the dialogue. The difference

of 1.38 Hz was minimal. Speaker F showed a value of 21.18 Hz at the beginning

and a value of 21.99 Hz at the end of the dialogue. Also here the difference of

0.81 Hz was minimal.



Table 5.13: Results of the ANOVA for F0 SD for the speakers C and F at thebeginning and at the end of the dialogue.

59

5 Results

Table 5.13 shows the results of the ANOVA for the dialogue of C and F.

No significant changes were found for F0 SD for the beginning and end of the

dialogue (0.895).

F0 Min

Table 5.14 shows the mean values (Mean), the standard deviation (SD) and

the Standard Error Mean (Std. Error Mean) for F0 Min of the speakers C and

F at the beginning and end of their dialogue.



ending C 113.03931 26 27.499529 5.393101ending F 101.65800 18 15.192292 3.580858

Table 5.14: Mean, Standard deviation (SD) and Standard Error Mean for the min-imal fundamental frequency (F0 Min) in Hertz (Hz) of the speakers Cand F at the beginning and end of the dialogue.

At the beginning of the dialogue speaker C exhibited a mean value of

125.65 Hz and at the end a value of 113.04 Hz. The value decreased about

12.61 Hz. Speaker F showed a value of 129.93 Hz at the beginning and a value

of 101.66 Hz at the end of the dialogue. The values lowered about 28.27 Hz.



Table 5.15: Results of the ANOVA for F0 Min for the speakers C and F at thebeginning and at the end of the dialogue.

Table 5.14 shows the results of the ANOVA. The parameter of F0 Min has

in the dialogue of the speakers C and F has shown to be highly significant

(0.000).

F0 Max

Table 5.16 shows the mean value (Mean), the standard deviation (SD) and the

Standard Error Mean (Std. Error Mean) for F0 Max.

60




ending C 319.56734 29 6.145114 33.092452ending F 266.61476 21 6.128564 28.084610

Table 5.16: Mean, Standard deviation (SD) and Standard Error Mean for the max-imal fundamental frequency (F0 Max) in Hertz (Hz) of the speakers Cand F at the beginning and end of the dialogue.

Speaker C’s mean values for F0 Max decreased minimally about 2.21 Hz

from 321.76 Hz at the beginning to 319.57 Hz at the end of the dialogue.

Speaker F showed a value from 285.37 Hz at the beginning of the dialogue and

a value of 266.61 Hz at the end. The decrease was by 18.76 Hz.



Table 5.17: Results of the ANOVA for F0 Max for the speakers C and F at thebeginning and at the end of the dialogue.

Table 5.17 shows the results of the ANOVA for F0 Max for the speakers C

and F at the different positions of the dialogue. The parameter did not show

to be significant (0.316).

Jitter

Table 5.18 shows values for jitter at the beginning and end of the dialogue

of the speakers C and F. Values are shown for the mean (Mean), standard

deviation (SD) and standard error mean (Std. Error Mean).



ending C 2.46824 29 0.813099 0.150989ending F 2.21362 21 0.450859 0.098386

Table 5.18: Mean, Standard deviation (SD) and Standard Error Mean for the Jit-ter in percent of the speakers C and F at the beginning and end thedialogue.

61

5 Results

Speaker C and F both exhibited mean values about 2.2 % for jitter. Speaker

C showed a value of 2.27 % at the beginning of the dialogue and a small increase

of 0.2 % to the end, leading to a value of 2.47 %. Speaker F also shows a small

increase for the mean value of 0.2 % from 2.01 % at the beginning to 2.21 %

at the end of the dialogue.



Table 5.19: Results of the ANOVA for jitter for the speakers C and F at the be-ginning and at the end of the dialogue.

Table 5.19 shows the results of the ANOVA made for jitter in the dialogue

of the speakers C and F at the beginning and end. The result shows that there

is a tendency for the changes in jitter to be significant (0.058).

Shimmer

Table 5.20 shows the mean values (Mean), the standard deviation (SD) and

the standard error mean (Std. Error Mean) for shimmer in the dialogue of C

and F at the beginning and end of their dialogue.



ending C 8.37276 29 2.160529 0.401200ending F 7.31910 21 1.224337 0.267172

Table 5.20: Mean, Standard deviation (SD) and Standard Error Mean for the shim-mer in percent of the speakers C and F at the beginning and end thedialogue.

Both speakers have shimmer values about 7 and 8 %. Speaker C had in

increase of 0.69 % from 7.68 % at the beginning of the dialogue to 8.37 %

at the end. Speaker F ’s shimmer value minimally decreased from 7.69 % to

7.32 % about 0.37 %.

62




Table 5.21: Results of the ANOVA for shimmer for the speakers C and F at thebeginning and at the end of the dialogue.

Table 5.21 shows the results of the ANOVA done for the dialogue of the

speakers C and F at the beginning and end of their dialogue. Shimmer has

not proven to be significant (0.393).

HNR

Table 5.22 shows the mean values (Mean), standard deviation (SD) and stan-

dard error mean (St. Error Mean) of HNR for the speakers C and F at the

beginning and end of their dialogue.



ending C 18.95059 29 3.150715 0.585073ending F 20.03333 21 1.729479 0.377403

Table 5.22: Mean, standard deviation and standard error mean for the harmonics-to-noise ratio (HNR) in decibel (dB) of the speakers C and F at thebeginning and end the dialogue.

Speaker C shows a small decrease of 0.16 dB from 19.11 dB at the beginning

to 18.95 dB at the end of the dialogue. Speaker F also exhibited a small

decrease of 0.18 dB from 20.21 dB at the beginning and 20.03 dB at the end

of the dialogue.



Table 5.23: Results of the ANOVA for harmonics-to noise ratio (HNR) for thespeakers C and F at the beginning and at the end of the dialogue.

63

5 Results

Table 5.23 shows the results of ANOVA for HNR for the beginning and end

of the dialogue of the speakers C and F. HNR has not proven to be significant

(0.705).

Summary of the results

A summary of the ANOVAs for all dialogues can be seen in Table 5.24. Detailed

results can be seen in Appendix D.

Dialogue partners F0 Mean F0 SD F0 Min F0 Max Jitter Shimmer HNR

A & F ∗C & F ∗ (∗)D & FH & FJ & F (∗) ∗K & F

Table 5.24: Results done with ANOVAs for all dialogues at the beginning and endfor the voice parameters of F0 Mean, F0 SD, F0 Min, F0 Max, Jitter,Shimmer and HNR.

In the dialogue between speakers A and F F0 SD has shown to be significant

(0.036). At the beginning of the dialogue speaker A had a value of 40.96 Hz and

a minimal decrease of 3.41 Hz to 37.54 Hz at the end of the dialogue. Speaker

F showed a strong decrease about 10.36 Hz from 27.31 Hz at the beginning to

16.95 Hz at the end of the dialogue. Detailed results of the analyses can be

seen in Appendix D.1

No significant changes of parameters were found for the dialogues of speaker

F with the speakers D, H and K. Appendices D.2, D.3 and D.5 show detailed

results of the analyses.

In the dialogue between the speakers J and F F0 Mean showed a tendency

for being significant (0.062). Speaker J showed a value of 229.51 Hz at the

beginning of the dialogue and a decrease of 9.62 Hz to 219.89 Hz at the end of

the dialogue. Speaker F showed a small decrease of 1.01 Hz from 211.08 Hz at

the beginning to 210.07 Hz at the end of the dialogue. Also the parameter of

F0 Max proved significant (0.041). At the beginning of the dialogue speaker

F showed a value of 349.95 Hz and a strong decrease of 25.48 Hz to 324.47 Hz

at the end. Speaker F exhibited a small decrease of 3.49 Hz from 276.54 Hz

at the beginning to 273.05 Hz at the end. Detailed results can be found in

Appendix D.4.

64

6 Discussion

In the present chapter the method used (Chapter 6.1) and the results (Chapter

6.2) are discussed. Afterwards, F0 is considered, as most significant changes of

the analyses occurred on features of F0 (Chapter 6.3). As other voice param-

eters were found to be significant in other studies, the influence of temporal

aspects are discussed (Chapter 6.4). At the end the influence of engagement

is examined as engagement rather than rapport has been presumed to have an

influence on accommodation/convergence (Chapter 6.5).

6.1 Discussion of method

The method of the Praat Voice Report analyses various parameters, including

those of F0, perturbation and harmonicity. For F0, the mean, the standard

deviation, the minimum and the maximum were taken for the analysis, which

gives a global view of the average and variation of the fundamental frequency.

Jitter and shimmer were taken for the analysis of perturbation in voice, so

that quality and irregularities of the frequency and the amplitude can be deter-

mined. From the different possible algorithms Jitter(local) and Shimmer(local)

were chosen. For measures of harmonicity the measure of harmonics-to-noise

ratio (HNR) was taken, which mirrors the degree of acoustic periodicity.

With the help of boxplots, generated with the help of SPSS [Inc12], ex-

treme outliers were discarded for the analyses. Notably, most extreme outliers

occurred for the minimum of the fundamental frequency for the individual

speakers (see Table 4.1, for a detailed table see Appendix A). An explanation

for that could be errors in measurements, as happened, for instance, for sam-

ple 2 from speaker F (see Figure 6.1). In this sample the Praat Voice Report

gave an F0 Min of 88.68 Hz. Due to the glottalization of some vowels in her

utterance the pitch was calculated incorrectly, which led to false values which

would have negatively influenced the analysis.

65

6 Discussion

Figure 6.1: Error of measurement that caused an extreme outlier in F0 Min.Sample 2 from speaker F from the beginning of the dialogue with J.(Utterance: “Ja, nee, ich hab hospitiert, also hinten drin beobachtetund vorbereitet und so, aber man redet ja immer trotzdem ziemlichviel.”)

Figure 6.2: Error of measurement in Praat that caused a wrong calculation of F0Max (and also F0 Min). Utterance: “Ja, genau, da wurd ich, wo ichumgekippt bin.”

The values of the maximum fundamental frequency should also be regarded

cautiously as errors in measurements can occur. In the boxplots made for

speaker F (for the beginnings and ends of all the dialogues she was involved

in) most outliers occurred for the maximum of the fundamental frequency

(see Table 4.2, for a detailed table see Appendix B). An example of a false

calculation can be seen in Figure 6.2, in which a false maximum fundamental

66

6.1 Discussion of method

frequency was calculated (379.18 Hz) due to glottalization.

Therefore values for the minimum and maximum fundamental frequency

should be read with caution. Overall, it can be assumed that errors in mea-

surement were not included in the analyses due to discarding extreme out-

liers.

Additionally, analyses of voices are usually done for held vowels [May13,

p. 169], mostly for the analysis of pathological voices. The present analyses

were done for samples of spontaneous speech, thus it can be expected that

values for jitter, shimmer and harmonics-to-noise ratio are not beneath the

threshold values for healthy and pathological voices set for these parameters.

Nevertheless the measured values have been constant in the study, as can be

seen in Table 6.1.

The values of jitter vary between 1.857 % and 2.533 % (see Table 6.1, which is

higher than the threshold for pathological voices beneath 1 % [May13, p. 145],

and the values for shimmer lay between 6.757 % and 9.512 %, which is also

much higher than the thresholds of 2.5 % [NFG06, p. 18] and 3.81 % [May13,

p. 156]. The values for the harmonics-to-noise ratio vary between 15.715 dB

and 19.772 dB and lie beneath the threshold for pathological voices of 20 dB

[MS08, p. 26].

As measured values for jitter, shimmer and HNR have been constant across

the analyses (see Table 6.1), although they cannot be compared to those mea-

sured in held vowels, it can be assumed that the Praat Voice Report is an ap-

propriate measure for spontaneous speech, even if not all parameters are useful.

Voicing parameters, i.e. pauses, occur more often in spontaneous speech than

in held vowels, such as in the usage of plosives in the speech stream or short

pauses, e.g. for breath taking.

A parameter that would also be interesting to include in the analysis of voice

would be vocal intensity, thus the analysis of amplitudes (as it is partly done

with the parameter of shimmer). As vocal intensity is perceived as loudness,

the question arises if speakers might converge in the height or form of ampli-

tudes. In the experiment of De Looze and colleagues [LORC11] voice intensity

was found to become more similar between conversational partners (among

other parameters). Analyses have also been done for individual words with

amplitude envelopes [Lew12] as a measure of spectral similarity.

67

6 Discussion

Parameter Speaker Position Mean Standard deviation N

Jitter F beginning 2.077 0.503 136end 2.047 0.545 129

A beginning 2.517 0.590 24end 2.315 0.378 23

C beginning 2.269 0.420 28end 2.468 0.813 29

D beginning 1.910 0.414 25end 1.857 0.377 22

H beginning 2.247 0.574 25end 2.415 0.708 22

J beginning 2.302 0.509 26end 2.173 0.292 23

K beginning 2.280 0.510 22end 2.533 0.539 21

Shimmer F beginning 7.482 1.497 137end 7.293 1.617 129

A beginning 9.504 1.457 24end 8.079 1.416 24

C beginning 7.676 1.395 29end 8.372 2.161 29

D beginning 7.088 1.194 25end 6.757 0.938 22

H beginning 7.102 1.296 25end 7.732 1.709 22

J beginning 9.291 1.350 26end 9.512 1.721 24

K beginning 7.164 1.376 22end 7.484 1.097 21

HNR F beginning 19.354 2.129 137end 19.772 2.185 129

A beginning 15.715 1.725 24end 16.789 1.438 24

C beginning 19.113 2.105 28end 18.951 3.151 29

D beginning 18.450 1.683 25end 18.391 1.757 22

H beginning 19.063 1.634 25end 18.905 1.526 22

J beginning 16.459 1.905 26end 16.726 2.049 24

K beginning 18.487 1.899 22end 17.849 1.584 21

Table 6.1: Mean and standard deviation values for jitter, shimmer and harmonics-to-noise ratio of speaker F and her conversational partners, measuredfrom samples from the beginnings and end of the six dialogues. Jitterand shimmer are measured in percent (%), the harmonics-to-noise ratio(HNR) in decibel (dB).

68

6.2 Discussion of results


In the following the results of the three analyses are discussed (Chapter 5.1-

5.3). For the first analysis (Individual dialogues) the speakers’ ratings about

their conversational partner are included. Afterwards the results of Analysis

2 (Speaker F) and Analysis 3 (Position in dialogue) are treated.

6.2.1 Discussion of Analysis 1 - Individual speakers

Analysis 1 (Individual speakers) revealed that speakers mainly changed on F0

and that speakers behaved differently, e.g. speaker F did not exhibit changes

in every dialogue and speakers seem to change or maintain in parameters,

independent of whether their conversational partner does. In the following

the results of Analysis 1 are discussed. Ratings of the speakers about their

conversational partner’s social attractiveness and competence (each on a scale

from -8 to 8) are included.

Dialogue A-F

In the dialogue between the speakers A and F, both showed significant changes

for different parameters: speaker F changed on the values of F0 SD, F0 Min

and F0 Max (which led to little variation and thus monotonous speech), while

speaker A changed on HNR (sounding more hoarsely) and exhibited a tendency

for shimmer being significant (leading to a less breathy or rough voice). As the

significant values of speaker F departed from those of speaker A (see Appendix

C.1), divergence can be assumed for speaker F. Speaker A’s values, on the

other hand, approached those of speaker F. Thus for her convergence can be

assumed.

The ratings from speaker F about speaker A do not agree with the above

described parameter values as the ratings are quite high (both 6). On the other

hand, speaker A’s values of HNR and shimmer approached those of speaker

F, which leads to the assumption that she converged towards speaker F. The

high rating for social attractiveness (8) confirms this result.

69

6 Discussion

Dialogue C-F

In the dialogue between the speakers C and F only speaker F changed: signif-

icant results were found for F0 Min and a tendency towards being significant

for F0 Mean. For the minimum of F0 both speakers lowered their values (the

lowering led to more variability in the voice), speaker F more than speaker C

(see Appendix C.2). This can be interpreted differently: as both speakers show

the same behaviour as they decrease their values, they are acting synchronous

and this can be interpreted as convergence. Another explanation might be

that both speakers diverged due to the fact that the distance between the

values from the beginning and those from the end increased. The values for

F0 Mean of speaker F can be interpreted as divergence as the lowering of the

value (perceived as a deeper voice) leads to an increase of the distance between

the conversational partners.

As both speakers rated each other with quite positive values for social attrac-

tiveness and competence (F about C: 5 and 5, C about F: 4 and 5) convergence

for F0 SD by acting synchronously can be assumed. Another explanation might

be that both largely maintained as speaker F only significantly changed on one

parameter.

Dialogue D-F

Significant parameters in the dialogue of the speakers D and F were F0 SD as

well as F0 Min for both speakers and additionally F0 Max for speaker D. The

values of F0 Min can be interpreted as convergence as speaker D is decreasing

in her values towards speaker F (and thus exhibits more variance in her speech)

and speaker F is increasing in her values (less variance) towards speaker D (see

Chapter 5.1).

The values for F0 Max, on the other hand, can be interpreted as divergence

as they drift apart at the end of the dialogue. The values for speaker D

increased (which leads to more variance), while those of speaker F decreased

(leading to less variance).

F0 SD, as a value mirroring F0 Min and F0 Max and thus the variation of F0,

can be interpreted differently. The values for speaker D at the beginning are

smaller, compared to those of speaker F, and increased in the end, whereas it

is the opposite for speaker F. An interpretation of the results for F0 SD is thus

70


difficult: one explanation might be that both tried to converge and thereby

increased or decreased too much, so that both speakers did not succeeded to

find an exact balance. The fact that the difference for F0 SD at the beginning

strongly decreased till the end supports this explanation. At the end they

are not similar to the value of the conversational partner, but at least they

approached it. Another possibility is that both speakers diverged and that they

both showed asynchronous behaviour as speaker D’s value increased while that

of speaker F decreased.

Yet another explanation might be that their convergence or divergence de-

pended on the context, such as they had different opinions about certain topics

and that voice parameters varied according to their opinions and on possibly

subsequent changes of emotion and arousal. Feeling more excited about a

topic or being more involved in the conversation might have caused the value

changes at the end for speaker F. The ratings of speaker C about speaker F’s

social attractiveness and competence were high (6 and 7), those of speaker

F about speaker C lower, but still positive (4 and 6). This might indicate

that convergence should be more likely than divergence and that the values

for F0 SD and F0 Min could be rather be interpreted as convergence.

Overall, convergence can be observed for F0 Min and divergence for F0

Max. As Pardo points out “convergence does not result in exact matching

of in all paramters at all times for all interlocutors” [Par12, p. 763]. Thus it

might be possible that not all parameters (also including syntactical, lexical

and sublexical aspects) have to be similar to a certain point in time.

Dialogue H-F

In the dialogue of speakers H and F only speaker H significantly changed on

F0 SD. As she lowers her value towards that of speaker F (and thus exhibits

less variation, leading to a more monotonous speech) convergence can be as-

sumed (see Appendix C.3). Another explanation might be maintenance as she

only changed on one parameter. For the values of speaker F maintenance is

reasonable.

The ratings of speaker H about F were neutral (social attractiveness and

competence, both 0). Thus maintenance can be assumed. The ratings of

speaker F about speaker H are positive (5 and 4). Nevertheless she might have

maintained as well.

71

6 Discussion

Dialogue J-F

In the dialogue of speakers J and F speaker J was the only one who changed

on voice parameters, namely she increased in F0 Min and decreased in F0 Max

(leading to less variation and thus more monotonous speech). Additionally, F0

Mean and F0 SD showed tendencies towards being significant (both lowered,

leading to a deeper voice and monotonous speech). All values can be inter-

preted as convergence as they all approach those of speaker F (see Appendix

C.4). Speaker F, on the other hand, exhibited no significant changes at all, but

only tendency for F0 Min in the direction of the value of speaker J (speaker

F lowered in F0 Min, leading to more variation in speech). Thus it can be

assumed that both speakers converged.

The ratings of speaker J about her conversational partner agree with the F0

values observed as they are (slightly) positive (social attractiveness: 2, com-

petence: 3). The ratings from speaker F about speaker H are both positive,

as the value for social attractiveness is quite positive (4) and that of com-

petence very high (8). Thus the changes on F0 Min could be interpreted as

convergence.

Dialogue K-F

In the dialogue between the speakers K and F speaker F was the only one who

exhibited significant changes, namely on the parameters of F0 SD and HNR.

The lowering of the value of F0 SD (leading to less variation in speech) and

increasing the value of HNR (sounding more hoarsely) can be interpreted as

divergence as they drift apart from the values of speaker K. Maintenance can

be assumed for speaker K (see Appendix C.5).

These findings contradict the rating of speaker K about speaker F as it was

quite positive (social attractiveness: 4, competence: 3). The ratings of speaker

F about speaker K were also positive (5 and 4), nevertheless divergence can

be assumed.

6.2.2 Discussion of Analysis 2 - Speaker F

The second analysis dealt with changes of voice parameters for speaker F in all

six dialogues she had with different conversational partners. Results indicate

72


that she significantly changed on F0, mainly F0 Mean, F0 SD and F0 Max.

Compared to the first analysis, in which the voice parameters were analysed

within the individual dialogues, differences can be observed: although speaker

F tended to change on values of F0 in the first analysis, no significant changes

for F0 Mean were found (only one tendency, see Table 5.6). In Analysis 2

(Speaker F), conducted over six dialogues, significant changes for F0 Mean were

found (see Table 5.9). These different results might be caused by discarding

outliers for every single sample collection separately, namely outliers of the

single dialogues (see Table 4.1 and Appendix A) and the sample collection for

speaker F for all dialogues (see Table 4.2 and Appendix B). Thus the sample

collection of all single dialogues for speaker F are not identical to the sample

collection of all dialogues of speaker F. This distinction was made in order

to avoid having sift out samples that are abnormal for speaker F in a single

dialogue, but normal for her among all dialogues.

Overall, the results confirm those of Analysis 1 (Individual dialogues, see

Chapter 5.1) as changes there also mainly occurred for F0. Analysis 1 also

revealed that speakers behaved differently - speaker F changed parameters in

some dialogues (e.g. dialogue D-F) and maintained on others (e.g. dialogue

H-F). This implies that speaker F reacted to different conversational partners

and that she consequently changed parameters of the fundamental frequency.

No significant changes for jitter, shimmer and HNR have been found.

6.2.3 Discussion of Analysis 3 - Position in dialogue

The third analysis (Position in dialogue) concerned changes of voice param-

eters between the beginnings and ends of the individual dialogues. Only few

significant results could be achieved, all within F0 (namely F0 SD, F0 Min,

F0 Max and a tendency towards being significant for F0 Mean) as well there

was a tendency for the changes of jitter towards to be significant (see Table

5.24). These results are similar to those of the first (see Chapter 5.1, Individual

speakers) and the second analysis (see Chapter 5.2, Speaker F) in which also

changes of F0 were (mainly) significant.

73

6 Discussion

6.3 Changes in fundamental frequency

Most changes of voice parameters were related to F0 in all three analyses

(Chapters 5.1-5.3). Gregory and colleagues found that information about for

instance social status [GW96] and dominance [GG02] can be conveyed through

fundamental frequency (or rather frequency bands below 500 Hz). In the

experiment by Gregory et al. participants heard their conversational partner

unaltered or filtered (low-pass filtered at 550 Hz or high-pass filtered at 1000

Hz high-pass filtered) [GDW97]. Convergence was found for the unaltered and

the low-pass filtered conditions, but not for the high-pass filtered condition

in which the regions of F0 had been cut out. Additionally ratings about

the conversation were more negative for the filtered conditions, slightly more

for the high-pass filtered condition. Gregory and colleagues concluded that

F0 plays a significant role as it transfers social information and allows for

accommodation.

In a similar experiment, Babel and Bulatov high-filtered recordings of words

uttered by a male speaker at 300 Hz (and thus cut out regions of F0) [BB11].

Participants then heard the unaltered or the filtered version and were told

to shadow the heard words. Analysis of F0 revealed that in the unaltered

condition repetitions were more similar to the target word as in the filtered

condition. In an additional AXB-perception test, judgements of listeners con-

firmed this result as they rated repetitions in the unaltered condition to be

more similar to the target word as repetitions of the filtered condition. As Ba-

bel and Bulatov point out, the measures of accommodation did not correlate.

They conclude that F0 is a parameter that can be accommodated, but that it

is not the only one, as signals are complex and multiple acoustic features are

available, e.g. VOT [Nie07, Nie10] and vowel quality [Bab09, Bab12].

The present thesis confirmed that speakers change most, perhaps due to

convergence or divergence, on values of F0. Almost no significant changes

were found for other parameters of voice. Thus it can be concluded that

across the dialogues (and thus a great time span) and across the examined

voice parameters convergence of voice mostly occurs for F0.

74

6.4 Temporal aspects

6.4 Temporal aspects

An explanation that changes in voice parameters mainly occurred mainly for

values of F0 and not for other parameters might be due to to the extraction

of samples from the first and the last five minutes of the dialogue. It might be

that accommodation effects occur very early in the dialogue as also Schweitzer

and Lewandowski supposed [SL13]. Nevertheless they could not find an effect

of time in their analyses of articulation rate.

De Looze and colleagues suppose that “mimicry [similar to convergence]

is not a linear phenomenon but rather dynamic” [LORC11, p. 1297]. This

then would imply that in a spontaneous dialogue phases of convergence and

non-convergence could occur.

In the experiment of Levitan and Hirschberg different parameters showed

to be significant when comparing mean values of parameters extracted from

different time spans [LH11]. In their experiment, participants, working in pairs,

played three computer games. They were not able to see each other. Results

indicate that during the first game participants changed significantly on mean

intensity, shimmer and NHR. Regarding the whole session, including all three

computer games, participants changed on jitter and F0 Mean. Regarding the

turn-level even more changes could be found (F0 Mean, F0 Max, shimmer,

NHR). Thus it can be concluded that accommodation/convergence is sensitive

to temporal aspects (and means of measurement), as also the results of Levitan

and Hirschberg differ from the findings of the present thesis. It can be assumed

that some voice parameters tend to change early in conversation and that

others require more time for the participants to converge. Additionally, it

could be possible that some voice parameters change on the global level, i.e.

over the whole conversation, and others on local levels, i.e. from turn to turn

[LH11].

As the dialogues analysed in the present thesis lasted for about 25 min-

utes F0 could possibly be a global parameter (as also changes of F0 Mean

were significant for the whole session in Levitan and Hirschberg’s experiment

[LH11]). Future research could bring more evidence for voice parameters and

their compass in speech.

75

6 Discussion

6.5 Engagement

De Looze and colleagues revealed that becoming more similar on prosodic

parameters (median F0, F0 SD, pauses (number and duration), voice inten-

sity) is correlated with involvement in the interaction rather than agreement1

[LORC11]. This finding agrees with the coordination-engagement hypothesis

of Niederhoffer and Pennebaker which states that “the more that two peo-

ple in a conversation are actively engaged with one another - in a positive

or even negative way - the more verbal and nonverbal coordination [is ex-

pected]” [NP02, p. 358]. The hypothesis assumes that engagement, whether

positive or negative, might influence accommodation/convergence, rather than

rapport [NP02, p. 358]. Also, attention plays a major role as speakers do not

engage if they are not listening to their conversational partner or/and if they

are distracted. The coordination-engagement hypothesis agrees with CAT as

speakers can still converge to achieve social goals or to arrange the conversa-

tion smoothly and or diverge/maintain to emphasize the differences between

themselves and others.

In the study of Schweitzer and Lewandowski speakers’ local articulation

rates were influenced by the preceeding articulation rate of the conversational

partner and social factors, i.e. mutual liking [SL13]. Results of other studies

also revealed that ratings of mutual attractiveness and/ or liking were influ-

ential [PG08, Nat75, ACG+11]. Thus it can be assumed that social factors

have an influence on speech. As the present ratings for social attractiveness

and competence agreed partially with observed changes in voice parameters, it

could be assumed that both, engagement of the speakers, possibly stimulated

by discussed topics, as well as the evaluation of the dialogue partner influenced

speakers and thus convergence. Additional dimensions, including phonetic tal-

ent, as it showed to be influential [Lew12], should also be regarded. In addition,

future research could reveal which factors are influential, how distinct these

are and how they are related.

1Speakers’ agreement was annotated for disagreement, neutral speech and agreement. In-volvement was annotated for a group of speakers (not for individual speakers) on a scalefrom 0 to 10.

76

7 Conclusion and Outlook

In the present thesis voice parameters were investigated at the beginning and

ends of dialogues. Three different analyses were conducted in order to identify

which voice parameters are most prone to accommodation. For this reason,

values for the fundamental frequency (mean, standard deviation, minimum,

maximum), jitter, shimmer and harmonics-to-noise ratio were considered.

Results showed that most significant changes of voice parameters occur for

the fundamental frequency. Thus it can be concluded that accommodation of

voice parameters, at least for the global time span of the dialogues, is expressed

through the average and the variance of the fundamental frequency.

Additionally, Analysis 1, concerning the individual speakers, showed that

voice changes for different parameter combinations, which indicates that ac-

commodation of voice is probably not a fully automatic process as proposed by

Pickering and Garrod [GP04, PG04, PG06, MPG12]. Instead social variables,

for instance to gain the approval of the conversational partner [SG82] or to

arrange the conversation unproblematic and smoothly [GGJ+95], as proposed

in the Communication Accommodation Theory, might influence the (speech)

behaviour of the conversational partners as well as engagement, as proposed

in the coordination-engagement hypothesis, and phonetic talent.

In future work, more detailed analyses could reveal new details about con-

vergence of voice parameters. Therefore voice parameters could be evaluated

on different temporal levels, e.g. over the whole conversation or turns, as some

parameters might change on a global level and others on local levels. Addi-

tionally, detailed analyses could bring new insights on the influences of context

and topic as well as engagement, attention and emotions on voice parameters.

Therefore a detailed questionnaire, revealing speakers’ opinions, impressions

and attitudes towards the discussed topics and the conversational partner and

also to the whole situation during the dialogue, could show what factors are

most influential on convergence of voice and to what extent. This might bring

77

7 Conclusion and Outlook

new insights into the causes and motivations for speakers to change along

different voice parameters.

A possible next step is the evaluation of engagement of conversational part-

ners (in the GeCo corpus). Therefore listeners could rate the degree of engage-

ment and possibly this factor proves influential on accommodation/convergence.

78

Bibliography

[Abe67] Abercrombie, D. (1967). Elements of general phonetics. Volume 203.

Edinburgh: Edinburgh University Press.

[ACG+11] Abrego-Collier, C., Grove, J., Sonderegger, M., Yu A.C.L.

(2011). Effects of speaker evaluation on phonetic convergence. In

Proceedings of the 17th International Congress of Phonetic Sci-

ences (ICPhS XVIII), Hong Kong, China, (192-195). Retrieved

2014, July 9 from http://www.icphs2011.hk/resources/OnlineProceedings/

RegularSession/Abrego-Collier/Abrego-Collier.pdf

[AJL87] Aronsson, K., Jonnson, L., Linell, P. (1987). The courtroom hear-

ing as a middle ground: Speech accommodation by lawyers and de-

fendants. Journal of Language and Social Psychology, 6(2), 99-115.

doi:10.1177/0261927X8700600202

[Azu97] Azuma, S. (1997). Speech accommodation and Japanese

emperor Hiroheto. Discourse and Society, 8(2), 189-202.

doi:10.1177/0957926597008002003

[Bab09] Babel, M.E. (2009). Phonetic and Social Selectivity in Speech Ac-

commodation (Doctoral Dissertation), University of California, Berkeley.

Retrieved 2014, July 9 from http://linguistics.berkeley.edu/dissertations/

Babel dissertation 2009.pdf

[Bab12] Babel, M. (2012). Evidence for phonetic and social selectivity in

spontaneous phonetic imitation. Journal of Phonetics, 40(1), 179-188.

doi:10.1016/j.wocn.2011.09.001

[Bau00] Baugh, J. (2000). Racial identification by speech. American Speech,

75(4), 362-364. doi:10.1215/00031283-75-4-362

[BB11] Babel, M., Bulatov D. (2011). The role of fundamental frequency

in phonetic accommodation. Language and Speech, 55(2), 231-248.

doi:10.1177/0023830911417695

79

http://www.icphs2011.hk/resources/OnlineProceedings/RegularSession/Abrego-Collier/Abrego-Collier.pdf

http://www.icphs2011.hk/resources/OnlineProceedings/RegularSession/Abrego-Collier/Abrego-Collier.pdf

http://dx.doi.org/10.1177/0261927X8700600202

http://dx.doi.org/10.1177/0957926597008002003

http://linguistics.berkeley.edu/dissertations/Babel_dissertation_2009.pdf

http://linguistics.berkeley.edu/dissertations/Babel_dissertation_2009.pdf

http://dx.doi.org/10.1016/j.wocn.2011.09.001

http://dx.doi.org/10.1215/00031283-75-4-362

http://dx.doi.org/10.1177/0023830911417695

Bibliography

[BFB04] Belin, P., Fecteau, S., Bedard, C. (2004). Thinking the voice: neural

correlates of voice perception. Trends in Cognitive Sciences, 8(3), 129-135.

doi:10.1016/j.tics.2004.01.008

[BG77] Bourhis, R.Y., Giles, H. (1977). The language of intergroup distinc-

tiveness. In H. Giles (Ed.), Language, Ethnicity and Intergroup Relations

(119135). London: Academic Press.

[BHL+94] Berry, D.S., Hansen, J.S., Landry-Pester, J.C., Meier, J.A. (1994).

Vocal determinants of first impressions of young children. Journal of Non-

verbal Behavior, 18(3), 187-197. doi:10.1007/BF02170025

[Boe03] Boersma, P. (2003, May 21). Voice 3. Shimmer. Retrieved

2014, June 14 from http://www.fon.hum.uva.nl/praat/manual/Voice 3

Shimmer.html

[Boe09] Boersma, P. (2009). Should jitter be measured by peak picking or

by waveform matching? Folia Phoniatrica et Logopaedica, 61(5), 305-308.

doi:10.1159/000245159

[Boe11] Boersma, P. (2011, March 2). Voice 2. Jitter. Retrieved 2014, June 14

from http://www.fon.hum.uva.nl/praat/manual/Voice 2 Jitter.html

[Bra94] Braun, A. (1994). The effect of cigarette smoking on vocal parameters.

In ESCA Workshop on Automatic Speaker Recognition, Identification and

Verification, Martigny, Switzerland (161-164). Retrieved 2014, July 9 from

www.isca-speech.org/archive open/archive papers/asriv94/sr94 161.pdf

[BW14] Boersma, P., Weenink, D. (2014). Praat: doing phonetics by computer

(Version 5.3.63) [Computer program]. Retrieved 2014, May 18. Available

from http://www.praat.org/

[Byr71] Byrne, D. (1971). The attraction paradigm. New York: Academic

Press.

[CB99] Chartrand, T.L., Bargh, J.A. (1999). The chameleon effect: the

perception-behavior link and social interaction. Journal of Personality

and Social Psychology, 76(6), 893-910. Retrieved 2014, June 9 from http:

//www.yale.edu/acmelab/articles/chartrand bargh 1999.pdf

[CJ97] Coupland, N., Jaworski, A. (1997). Relevance, accommodation and

conversation: Modeling the social dimension of communication. Multilin-

gua, 16(2-3), 233-258. doi:10.1515/mult.1997.16.2-3.233

80

http://dx.doi.org/10.1016/j.tics.2004.01.008

http://dx.doi.org/10.1007/BF02170025

http://www.fon.hum.uva.nl/praat/manual/Voice_3__Shimmer.html

http://www.fon.hum.uva.nl/praat/manual/Voice_3__Shimmer.html

http://dx.doi.org/10.1159/000245159

http://www.fon.hum.uva.nl/praat/manual/Voice_2__Jitter.html

www.isca-speech.org/archive_open/archive_papers/asriv94/sr94_161.pdf

http://www.praat.org/

http://www.yale.edu/acmelab/articles/chartrand_bargh_1999.pdf

http://www.yale.edu/acmelab/articles/chartrand_bargh_1999.pdf

http://dx.doi.org/10.1515/mult.1997.16.2-3.233

Bibliography

[Cla96] Clark, H.H. (1996). Using language. Cambridge: Cambridge Univer-

sity Press.

[DB01] Dijksterhuis, A., Bargh J.A. (2001). The perception-behavior express-

way: Automatic effects of social perception an social behavior. In M.P.

Zanna (Ed.), Advances in Experimental Social Psychology, Volume 32,

(pp. 1-40). Retrieved 2014, January 1 from http://www.yale.edu/acmelab/

articles/Dijksterhuis Bargh 2001.pdf

[DBSA11] Duggan, A.P., Bradshaw Y.S., Swergold N., Altman W. (2011):

When rapport building extends beyond affiliation: communication overac-

commodation toward patients with disabilities. The Permanente Journal,

15(2), 23-30. Retrieved 2014, June 9 from http://www.ncbi.nlm.nih.gov/

pmc/articles/PMC3140744/pdf/i1552-5775-15-2-23.pdf

[DR10] Dias, J.W. and Rosenblum, L.D. (2010). Visual influences on interac-

tive speech alignment. Perception, 40(12), 1457-1466. doi:10.1068/p7071

[EN93] Edwards, H., Noller, P. (1993). Perceptions of overaccommodation

used by nurses in communication with the elderly. Journal of Language

and Social Psychology, 12(3), 207-223. doi:10.1177/0261927X93123003

[ES96] Ellgring, H., Scherer, K.R. (1996). Vocal indicators of mood

change in depression. Journal of Nonverbal Behavior, 20(2), 83-110.

doi:10.1007/BF02253071

[Esl78] Esling, J. (1978). The identification of features of voice quality in social

groups. Journal of the International Phonetic Association, 8(1-2), 18-23.

doi:10.1017/S0025100300001699

[FBSW03] Fowler, C.A., Brown, J.M., Sabadini, L., Weihing, J. (2003). Rapid

access to speech gestures in perception: Evidence from choice and simple

response time tasks. Journal of Memory and Language, 49(3), 396 - 413.

doi:10.1016/S0749-596X(03)00072-X

[FHE07] Farrus, M., Hernando, J., Ejarque, P. (2007). Jitter and shimmer

measurements for speaker recognition. In Proceedings of the 8th Annual

Conference of the International Speech Communication Association (In-

terspeech 2007), Antwerp, Belgium (778-781). Received 2014, July 18 from

http://nlp.lsi.upc.edu/papers/far jit 07.pdf

81

http://www.yale.edu/acmelab/articles/Dijksterhuis_Bargh_2001.pdf

http://www.yale.edu/acmelab/articles/Dijksterhuis_Bargh_2001.pdf

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3140744/pdf/i1552-5775-15-2-23.pdf

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3140744/pdf/i1552-5775-15-2-23.pdf

http://dx.doi.org/10.1068/p7071

http://dx.doi.org/10.1177/0261927X93123003

http://dx.doi.org/10.1007/BF02253071

http://dx.doi.org/10.1017/S0025100300001699

http://dx.doi.org/10.1016/S0749-596X(03)00072-X

http://nlp.lsi.upc.edu/papers/far_jit_07.pdf

Bibliography

[FM40] Fay, P.J., Middleton, W.C. (1940). Judgement of intelligence from the

voices transmitted over a public address system. Sociometry, 3(2), 186-191.

doi:10.2307/2785442

[FMK98] Frohlich, M., Michaelis, D., Kruse E. (1998). Objektive Beschrei-

bung der Stimmguter unter Verwendung des Heiserkeits-Diagramms. HNO,

46(7), 684-489. doi:10.1007/s001060050295

[GA04] Goldinger, S.D., Azuma, T. (2004). Episodic memory reflected in

printed word naming. Psychonomic: Bulletin and Review, 11(4), 716-722.

doi:10.3758/BF03196625

[GCC91] Giles, H., Coupland, J., Coupland, N. (1991). Accommodation the-

ory: Communication, context and consequence. In H. Giles, J. Coupland,

N. Coupland (Eds.), Contexts of accommodation. Developments in applied

sociolinguistics (1-68). Westport: Greenwood Publishing Group.

[GDW97] Gregory, S.W., Dagan, K., Webster, S. (1997). Evaluating the re-

lation of vocal accommodation in conversation partner’s fundamental fre-

quencies to perceptions of communication quality. Journal of Nonverbal

Behavior, 27(1), 23-43. doi:10.1023/A:1024995717773

[GER93] Gordon, P.C., Eberhardt, J.L., Rueckl, J.G. (1993). Attentional

modulation of phonetic significance of acoustic cues. Cognitive Psychology,

25(1), 1-42. doi:10.1006/cogp.1993.1001

[GG98] Gallois, C., Giles, H. (1998). Accommodating mutual influence in in-

tergroup encounters. In M.T. Palmer, G.A. Barnett (Eds.), Progress in

communication sciences, Volume 14, (135-162).

[GG02] Gregory, S.W., Gallagher, T.J. (2002). Spectral analysis of candidates’

nonverbal vocal communication: Predicting U.S. presidential election out-

comes. Social Psychology Quarterly, 65(3), 298-308. Retrieved 2014, July 9

from http://www.jstor.org/stable/3090125

[GG13] Giles, H., Gasiorek , J. (2013). Parameters of nonaccommodation:

Refining and elaborating Communication Accommodation Theory. In J.P.

Forgas, O. Vincze, J. Laszlo (Eds.), Social cognition and communication.

The Sydney Symposium of Social Pschology. (155-172). New York: Psy-

chology Press.

82

http://dx.doi.org/10.2307/2785442

http://dx.doi.org/10.1007/s001060050295

http://dx.doi.org/10.3758/BF03196625

http://dx.doi.org/10.1023/A:1024995717773

http://dx.doi.org/10.1006/cogp.1993.1001

http://www.jstor.org/stable/3090125

Bibliography

[GGJ+95] Gallois, C., Giles, H., Jones, E., Cargile, A.C., Ota, H. (1995).

Accommodating intercultural encounters: Elaborations and extensions. In

R. Wisemann (Ed.), Intercultutal communication theory (115-1147). Thou-

sand Oaks: Sage publications.

[GH82] Gregory, S.W, Hoyt, B.R. (1982). Conversation partner mutual adap-

tation as demonstrated by fourier series analysis. Journal of Psycholinguis-

tic Research, 11(1), 35-46. doi:10.1007/BF01067500

[Gil73] Giles, H. (1973). Accent mobility: A model and some data. Anthro-

pological Linguistics, 15(2), 87-109. Retrieved 2014, July 9 from http:

//www.jstor.org/stable/30029508

[GLS13] Garnier, M., Lamalle, L., Sato, M. (2013). Neural correlates in pho-

netic convergence and speech imitation. Frontiers in Psychology, 4(600),

15 pages. doi:10.3389/fpsyg.2013.00600

[GO06] Giles, H., Ogay, T. (2006). Communication accommodation theory. In

B. B. Whaley and W. Samter (Eds.), Explaining Communication: Con-

temporary theories and exemplars (293-310). Mawah: Lawrence Erlbaum

Assosiates.

[Gol96] Goldinger, S.D. (1996). Words and voices: Episodic traces in spoken

identification and recognition memory. Journal of Experimental Psychol-

ogy: Learning, Memory and Cognition, 22(5), 1166-1183. Retrieved 2014,

July 9 from http://www.public.asu.edu/∼sgolding/docs/pubs/Goldinger

JEPLMC 96.pdf

[Gol97] Goldinger, S.D (1997). Perception and production in an episodic lex-

icon. In K. Johnson, J.W. Mullennix (Eds.), Talker variability in speech

processing (33-66). San Diego: Academic Press.

[Gol98] Goldinger, S.D. (1998). Echoes of echoes? An episodic the-

ory of lexical access. Psychological Review, 105(2), 251-279. Retrieved

2014, July 9 from http://www.cog.brown.edu/courses/cg195/pdf files/

Goldinger%201998.pdf

[Gol13] Goldinger, S.D. (2013). The cognitive basis of spontaneous imitation:

Evidence from the visual world. In Proceedings of Meetings in Acoustics,

Volume 19, Montreal, Canada (6 pages). doi:10.1121/1.4800039

83

http://dx.doi.org/10.1007/BF01067500



http://dx.doi.org/10.3389/fpsyg.2013.00600

http://www.public.asu.edu/~sgolding/docs/pubs/Goldinger_JEPLMC_96.pdf

http://www.public.asu.edu/~sgolding/docs/pubs/Goldinger_JEPLMC_96.pdf

http://www.cog.brown.edu/courses/cg195/pdf_files/Goldinger%201998.pdf

http://www.cog.brown.edu/courses/cg195/pdf_files/Goldinger%201998.pdf

http://dx.doi.org/10.1121/1.4800039

Bibliography

[GOG05] Gallois, C., Ogay, T., Giles, H. (2005): Communication Accommo-

dation Theory, In W.B. Gudykunst (Ed.), Theorizing about intercultural

communication (121-148). Thousand Oaks: Sage publications.

[GP04] Garrod, S., Pickerin M.J. (2004). Why is conversation so easy? Trends

in Cognitive Sciences, 8(1), 8-11. doi:10.1016/j.tics.2003.10.016

[GP75] Giles, H., Powesland, F.P. (1975). Speech style and social variation,

Volume 7 of European monographs in social psychology. London: Academic

Press.

[GPL91] Goldinger, S., Pisoni, D., Logan J. (1991). On the nature of talker

variability effects on recall of spoken word lists. Journal of Experimen-

tal Psychology: Learning, Memory and Cognition, 17(1), 152-162. Re-

trieved 2014, July 9 from http://www.public.asu.edu/∼sgolding/docs/

pubs/Goldinger etal JEPLMC 91.pdf

[Gre90] Gregory, S. W. (1990). Analysis of fundamental frequency reveals co-

variation in interview partners’ speech. Journal of Nonverbal Behavior,

14(4), 237-251. doi:10.1007/BF00989318

[GS10] Goudbeek, M., Scherer, K. (2010). Beyond arousal: Valence and po-

tency/control cues in the vocal expression of emotion. Journal of the Acous-

tical Society of America, 128(3), 1322-1336. doi:10.1121/1.3466853

[GTB73] Giles, H, Taylor, D.M., Bourhis, R. (1973). Towards a theory of in-

terpersonal accommodation through language: Some Canadian data. Lan-

guage in Society, 2(2). 177-192. doi:10.1017/S0047404500000701

[GW96] Gregory, S.W., Webster, S. (1996). A nonverbal signal in voices of in-

terview partners effectively predicts communication accommodation and

social status perception. Journal of Personality and Social Psychology,

70(6), 1231-1240. Retrieved 2014, July 9 from http://www.columbia.edu/

∼rmk7/HC/HC Readings/Gregory.pdf

[HDJ+01] Hollien, H., DeJong, G., Martin, C.A., Schwartz, R., Lilje-

gren, K. (2001). Effects of ethanol intoxication on speech suprasegmen-

tals.Journal of the Acoustical Society of America, 110(6), 3198-32056.

doi:10.1121/1.1413751

[Hin86] Hintzman, D. (1986). Schema abstraction in multiple-trace

memory model. Psychological Review, 93(4), 411-428. Retrieved

84

http://dx.doi.org/10.1016/j.tics.2003.10.016

http://www.public.asu.edu/~sgolding/docs/pubs/Goldinger_etal_JEPLMC_91.pdf

http://www.public.asu.edu/~sgolding/docs/pubs/Goldinger_etal_JEPLMC_91.pdf

http://dx.doi.org/10.1007/BF00989318

http://dx.doi.org/10.1121/1.3466853

http://dx.doi.org/10.1017/S0047404500000701

http://www.columbia.edu/~rmk7/HC/HC_Readings/Gregory.pdf

http://www.columbia.edu/~rmk7/HC/HC_Readings/Gregory.pdf

http://dx.doi.org/10.1121/1.1413751

Bibliography

2014, July 9 from http://www.sfs.uni-tuebingen.de/∼gjaeger/lehre/ss08/

exemplarBased/hintzman86.pdf

[Hog85] Hogg, M. (1985). Masculine and feminine speech in dyads and groups:

A study of speech style and gender salience. Journal of Language and So-

ciety, 4(2), 99-112. doi: 10.1177/0261927X8500400202

[Inc12] IBM Corporation (2012): IBM SPSS Statistics for Windows (Version

21.0) [Computer program]. New York: IBM Corporation.

[JF70] Jaffe, J., Feldstein, S. (1970). Rhythms of Dialogue. New York: Aca-

demic Press.

[JL13] Janssen, J., Laatz, W. (2013). Statistische Datenanalyse mit SPSS.

Eine anwendungsorientierte Einfuhrung in das Basissystem und das Modul

Extakte Tests. Berlin, Heidelberg: Springer Gabler Verlag.

[Joh03] Johnson, K. (2003). Acoustic and auditory phonetics. 2nd edition. Ox-

ford: Blackwell Publishing.

[KFM02] Krauss, R.M., Freyberg R., Morsella E (2002). Inferring speakers’

physical attributes from their voices. Journal of Experimental Social Psy-

chology, 38(6), 618625. doi:10.1016/S0022-1031(02)00510-3

[KJS09] Ko, S.J., Judd, C.M., Stapel, D.A. (2009). Stereotyping based on

voice in the presence of individuating information: Vocal femininity affects

perceived competence but not warmth. Personality and Social Psychology

Bulletin, 35(2), 198-211. doi:10.1177/0146167208326477

[KP04] Krauss, R. M. and Pardo, J. S. (2004). Commentary on Pickering and

Garrod. Is alignment always the result of automatic priming? Behavioral

and Brain Sciences, 27(2), 203-204. doi:10.1017/S0140525X0436005X

[KS11] Kreimanm J. and Sidtis, D. (2011). Foundations of voice studies.

An interdisciplinary approach to voice production and perception. Oxford:

Wiley-Blackwell.

[Lav68] Laver, J.M.D. (1968). Voice quality and indexical information. British

Journal of disorders of Communication, 3(1), 43-54. Retrieved 2014, July

10 from http://vambo.cent.gla.ac.uk/media/media 200297 en.pdf

[Lav76/97] Laver, J. (1997). Language and non-verbal communication. In J.

Laver (Ed.), The gift of speech: readings in the analysis of speech and voice

(131-146). Edinburgh: Edinburgh University Press.

85

http://www.sfs.uni-tuebingen.de/~gjaeger/lehre/ss08/exemplarBased/hintzman86.pdf

http://www.sfs.uni-tuebingen.de/~gjaeger/lehre/ss08/exemplarBased/hintzman86.pdf

http://dx.doi.org/ 10.1177/0261927X8500400202

http://dx.doi.org/10.1016/S0022-1031(02)00510-3

http://dx.doi.org/10.1177/0146167208326477

http://dx.doi.org/10.1017/S0140525X0436005X

http://vambo.cent.gla.ac.uk/media/media_200297_en.pdf

Bibliography

(Reprinted from Language and Speech, Volume 7 of Handbook of Percep-

tion, 345-363, by E.C. Carterette, M.P. Friedmann, Eds., 1976, New York:

Academic Press)

[Lav03] Laver, J. (2003). Three semiotic layers of spoken communication.

Journal of Phonetics, 31(3-4), 413-415. doi:10.1016/S0095-4470(03)00034-2

[Lew12] Lewandowski, N. (2012): Talent in nonnative phonetic dialogue (Doc-

toral Dissertation). University of Stuttgart, Stuttgart. Retrieved 2014,

July 10 from http://elib.uni-stuttgart.de/opus/volltexte/2012/7402/pdf/

Lewandowski.pdf

[LH11] Levitan, R., Hirschberg, J. (2011). Measuring acoustic-prosodic en-

trainment with respect to multiple levels and dimensions. In Proceed-

ings of the 12th Annual Conference of the International Speech Com-

munication Association (Interspeech 2011), Florence, Italy (3081-3084).

Retrieved 2014, July 10 from http://www.cs.columbia.edu/∼julia/papers/

levitan&hirschberg11.pdf

[Lig89] Lightfoot, N. (1989). Effects of familiarity on serial recall for spoken

word lists. Research on speech perception, Progress report No. 15., 421-443.

Retrieved 2014, July 10 from http://files.eric.ed.gov/fulltext/ED318074.

pdf

[Lin98] Linville, S. (1998). Acoustic correlates of perceived versus actual sexual

orientation in mens speech. Pholia Phoniatrica et Logopaedica, 50(1), 3548.

doi:10.1159/000021447

[LJB05] Laukka, P., Juslin, P.N., Bresin, R. (2005). A dimensional approach

to vocal expression of emotion. Cognition and Emotion, 19(5), 633-653.

doi:10.1080/02699930441000445

[LLA+08] Laukka, P., Linnman, C., Ahs, F., Pissiota, A., Frans, O., Faria,

V., Michelgard, A., Appel, L., Frederikson, M., Furmark, T. (2008). In a

nervous voice: acoustic analysis and perception of anxiety in social phobic’s

speech. Journal of Nonverbal Behavior, 32(4), 195-214. Retrieved 2014,

July 10 from http://www.ohio.edu/people/leec1/documents/sociophobia/

Laukka Petri.pdf

[LORC11] de Looze, C., Oertel, C., Rauzy, S., Campbell (2011). Mea-

suring dynamics of mimicry by means of prosodic cues in conversa-

tional speech. In Proceedings of the 17th International Congress of Pho-

86

http://dx.doi.org/10.1016/S0095-4470(03)00034-2

http://elib.uni-stuttgart.de/opus/volltexte/2012/7402/pdf/Lewandowski.pdf

http://elib.uni-stuttgart.de/opus/volltexte/2012/7402/pdf/Lewandowski.pdf

http://www.cs.columbia.edu/~julia/papers/levitan&hirschberg11.pdf

http://www.cs.columbia.edu/~julia/papers/levitan&hirschberg11.pdf

http://files.eric.ed.gov/fulltext/ED318074.pdf

http://files.eric.ed.gov/fulltext/ED318074.pdf

http://dx.doi.org/10.1159/000021447

http://dx.doi.org/10.1080/02699930441000445

http://www.ohio.edu/people/leec1/documents/sociophobia/Laukka_Petri.pdf

http://www.ohio.edu/people/leec1/documents/sociophobia/Laukka_Petri.pdf

Bibliography

netic Sciences (ICPhS XVII), Hong Kong, China (1294-1297). Retrieved

2014, July 9 from http://www.icphs2011.hk/resources/OnlineProceedings/

RegularSession/de%20Looze/de%20Looze.pdf

[LP05] Levi, S., Pisoni D. P. (2005). Indexical and Linguistic Channels

in Speech Perception: Some Effects of Voiceovers on Advertising Out-

comes. Research on spoken language processing, Progress Report No. 27,

65-80. Retrieved 2014, July 10 from http://www.iu.edu/∼srlweb/pr/27/

65-Levi-Pisoni.pdf

[MATB14] McAleer, P., Todorov, A., Belin, P. (2014). How do you say

’Hello’? Personality Impressions from brief novel voices. PLoS ONE, 9(3),

doi:10.1371/journal.pone.0090779

[May13] Mayer, J. (2013). Phonetische Analysen mit Praat. Ein Handbuch

fr Ein- und Umsteiger. Retrieved 2014, May 18 from http://praatpfanne.

lingphon.net/downloads/praat manual.pdf

[MD87] Murphy, C.H., Doyle P.C. (1987). The effects of cigarette smoking on

voice-fundamental frequency. Otolaryngol Head Neck Surgery, 97(4), 376-

380. Retrieved 2014, July 10 from http://oto.sagepub.com/content/97/4/

376

[MMPS89] Martin, C.S., Mullennix, J.W., Pisoni, D.B., Summers, W.V.

(1989). Effects of talker variability on recall of spoken word lists. Jour-

nal of Experimental Psychology: Learning, Memory and Cognition, 15(4),

676-684. Retrieved 2014, July 10 from http://www.ncbi.nlm.nih.gov/pmc/

articles/PMC3510481/pdf/nihms418731.pdf

[MPG12] Meneti, L., Pickering, M.J., Garrod, S.C. (2012). Toward a neural

basis of interactive alignment in conversation. Frontiers in Human Neuro-

science, 6(185), 9 pages. doi:10.3389/fnhum.2012.00185

[MS08] Minnema, W., Stoll H.-C., (2008). Objektive computergesttzte Stim-

manalyse mit Praat. Forum Logopdie, 4(22), 24-29. Retrieved 2014, July

10 from https://www.wevosys.com/knowledge/ data knowledge/13.pdf

[MTHH14] Moreau, M.L., Thiam, N., Harmegnies, B., Huet, K.(2014). Can

listeners assess the sociocultural status of speakers who use a language they

are unfamiliar with? A case study of Senegalese and European students

listening to Wolof speakers. Journal of Language in Society, 43(3), 333-348.

doi:10.1017/S0047404514000220

87

http://www.icphs2011.hk/resources/OnlineProceedings/RegularSession/de%20Looze/de%20Looze.pdf

http://www.icphs2011.hk/resources/OnlineProceedings/RegularSession/de%20Looze/de%20Looze.pdf

http://www.iu.edu/~srlweb/pr/27/65-Levi-Pisoni.pdf

http://www.iu.edu/~srlweb/pr/27/65-Levi-Pisoni.pdf

http://dx.doi.org/10.1371/journal.pone.0090779

http://praatpfanne.lingphon.net/downloads/praat_manual.pdf

http://praatpfanne.lingphon.net/downloads/praat_manual.pdf

http://oto.sagepub.com/content/97/4/376

http://oto.sagepub.com/content/97/4/376

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3510481/pdf/nihms418731.pdf


http://dx.doi.org/10.3389/fnhum.2012.00185

https://www.wevosys.com/knowledge/_data_knowledge/13.pdf

http://dx.doi.org/10.1017/S0047404514000220

Bibliography

[Mur99] Murphy, P.J. (1999). Perturbation-free measurement of the

harmonics-to-noise ratio in voice signals using pitch synchronous harmonic

analysis. Journal of the Acoustical Society of America, 105 (5), 2866-28881.

doi:10.1121/1.426901

[Nat75] Natale, M. (1975). Convergence of mean vocal intensity in dyadic

communication as a function of social desirability. Journal of Personality

and Social Psychology, 32(5), 790-804. doi:10.1037/0022-3514.32.5.790

[NFG06] Nawka T., Franke I., Galkin, E. (2006): Objektive Messverfahren in

der Stimmdiagnostik. Forum Logopdie, 4(20), 14-21. Retrieved 2014, July

10 from https://www.wevosys.com/knowledge/ data knowledge/7.pdf

[NP02] Niederhoffer, K.G., Pennebaker, J.W. (2002). Linguistic style match-

ing in social interaction. Journal of Language and social psychology, 21(4),

337-360. doi:10.1177/026192702237953

[Nie07] Nielsen, K.Y. (2007). Implicit phonetic imitation is constrained by

phonemic contrast. In Proceedings of the 16th International Congress of

Phonetic Sciences (ICPhS XVI), Saarbrucken, Germany (1961-1964). Re-

trieved 2014, July 10 from http://www.icphs2007.de/conference/Papers/

1641/1641.pdf

[Nie10] Nielsen, K.Y. (2010): Specificity and abstractness of VOT imitation.

Journal of Phonetics, 39(2), 132-142. doi:10.1016/j.wocn.2010.12.007

[NNS02] Namy, L.L., Nygaard, C. L., Sauerteig D. (2002). Gender differences

in vocal accommodation: The role of perception. Journal of Language and

Social Psychology, 21(4), 422-432. doi:10.1177/026192702237958

[NQ08] Nygaard, L.C., Queen, J.S. (2008). Communicating emotion: Linking

affective prosody and word meaning. Journal of Experimental Psychology:

Human Perception and Performance, 34(4), 1017-1030. doi: 10.1037/0096-

1523.34.4.1017

[NSP94] Nygaard, L.C., Sommers, M.S., Pisoni, D.B. (1994). Speech per-

ception as a talker-contingent process. Psychological Science, 5(1), 42-46.

doi:10.1111/j.1467-9280.1994.tb00612.x

[NW11] Newman, M., Wu, A. (2004): Do you sound Asian when you speak En-

glish? Racial identification and voice in Chinese and Koreans’s American

English. American Speech, 86(2), 152-178. doi:10.1215/00031283-1336992

88

http://dx.doi.org/10.1121/1.426901

http://dx.doi.org/10.1037/0022-3514.32.5.790

https://www.wevosys.com/knowledge/_data_knowledge/7.pdf

http://dx.doi.org/10.1177/026192702237953

http://www.icphs2007.de/conference/Papers/1641/1641.pdf

http://www.icphs2007.de/conference/Papers/1641/1641.pdf


http://dx.doi.org/10.1177/026192702237958

http://dx.doi.org/ 10.1037/0096-1523.34.4.1017

http://dx.doi.org/ 10.1037/0096-1523.34.4.1017

http://dx.doi.org/10.1111/j.1467-9280.1994.tb00612.x

http://dx.doi.org/10.1215/00031283-1336992

Bibliography

[Par06] Pardo, J. (2006). On phonetic convergence during conversational in-

teraction. Journal of Acoustical Society of America, 119(4), 2382-2393.

doi:10.1121/1.2178720

[Par10] Pardo, J. (2010). Expressing oneself in conversational interaction. In E.

Morsella (Ed.), Expressing oneself/ expressing ones self: communication,

cognition, language, and identity (183-196). London: Taylor and Francis.

[Par12] Pardo, J. (2012). Reflections on phonetic convergence: Speech percep-

tion does not mirror speech production. Language and Linguistics Compass,

6(12), 753-767. doi:10.1002/lnc3.367

[Pea31] Pear, T.H. (1931). Voice and personality as applied to radio broadcast-

ing. New York: Wiley.

[PG04] Pickering, M.J., Garrod S. (2004). Toward a mechanistic psy-

chology of dialogue. Behavioral and brain sciences, 27(2), 169-190.

doi:10.1017/S0140525X04000056

[PG06] Pickering, M.J., Garrod, S. (2006). Alignment as the Basis for Suc-

cessful Communication. Research on Language and Computation, 4(2-3),

203-228. Retrieved 2014, July 10 from http://www.speech.kth.se/∼edlund/

bielefeld/references/pickering-and-garrod-2006.pdf

[PG08] Pitts, M.J., Giles, H. (2008). Social psychology and personal relation-

ships: Accommodation and relational influences across time and contexts.

In: G. Antos, V. Ventola (Eds.), Handbook of Interpersonal Communica-

tion, Volume 2 of Handbook of Applied Linguistics (15-31). Berlin: Mouton

de Gruyter.

[PGP93] Palmeri, Goldinger, Pisioni (1993). Episodic encoding of voice at-

tributes and recognition memory for spoken words. Journal of Exper-

imental Psychology: Learning, Memory and Cognition, 19(2), 309-328.

Retrieved 2014, July 10 from http://www.ncbi.nlm.nih.gov/pmc/articles/

PMC3499966/pdf/nihms418709.pdf

[PGSK12] Pardo, J., Gibbons, R., Suppes, A., Krauss, R.M. (2012). Phonetic

convergence in college roommates. Journal of Phonetics, 40(1), 190-197.

doi:10.1016/j.wocn.2011.10.001

[Pie01] Pierrehumbert, J. (2001). Exemplar dynamics: word frequency, leni-

tion and contrast. In Bybee, J. and Hopper P. (Eds.), Frequency effects

89

http://dx.doi.org/10.1121/1.2178720

http://dx.doi.org/10.1002/lnc3.367

http://dx.doi.org/10.1017/S0140525X04000056

http://www.speech.kth.se/~edlund/bielefeld/references/pickering-and-garrod-2006.pdf

http://www.speech.kth.se/~edlund/bielefeld/references/pickering-and-garrod-2006.pdf




Bibliography

and the emerge of linguistic structure, Volume 45 of Typological studies of

language(137-157). Amsterdam: John Benjamins Publishing.

[PJK10] Pardo, J., Jay, I.C., Krauss, R.M. (2010). Conversational role influ-

ences speech imitation. Attention, Perception and Psychophysics, 72(8),

2254-2264. doi:10.3758/BF03196699

[PLB+09] Petrovic-Lazic M., Babac, S., Vukovic, M., Kosanovic, R. and

Ivankovic, Z. (2009): Acoustic Voice Analysis of Patients With Vocal Fold

Polyp Journal of Voice,25(1), 94-97. doi:10.1016/j.jvoice.2009.04.002

[RST04] Ryalls, J., Simon, M., Thomason, J. (2004). Voice Onset Time pro-

duction in older Caucasian- and African-Americans. Journal of multilin-

gual communication disorders, 2(1), 61-67. Retrieved 2014, July 10 from

http://informahealthcare.com/doi/pdf/10.1080/1476967031000090980

[RZB97] Ryalls, J., Zipprer, A. Baldauff, P. (1997). A preliminary investiga-

tion of the effects of gender and race on Voice Onset Time. Journal of

Speech, Language and Hearing Research, 40(3), 642-645. Retrieved 2014,

July 10 from http://jslhr.pubs.asha.org/article.aspx?articleid=1781846

[Sau01] de Saussure, F. (2001). Grundfragen der allgemeinen Sprachwis-

senschaft (3rd ed.). (H. Lommel, Transl.). Berlin: de Gruyter. (Original

work published 1916)

[SBS75] Smith, B.L., Brown, B.L. Strong W.J., Rencher A.C. (1975). Effects

of speech rate on personality perception. Language and Speech, 18(2), 145-

152. Retrieved 2014, July 10 from http://las.sagepub.com/content/18/2/

145.full.pdf

[SBP83] Street, R.L., Brady, R.M., Putman, W.B. (1983). The influence

of speech rate stereotypes and rate similarity or listeners’ evaluations

of speakers. Journal of Language and Social Psychology, 2(1), 37-56.

doi:10.1177/0261927X8300200103

[SBW01] Scherer, K.R., Banse, R., Wallbott, H.G. (2001). Emotion inferences

from vocal expression correlate across languages and cultures. Journal of

cross-cultural psychology, 32(1), 76-92. doi: 10.1177/0022022101032001009

[Sch86] Scherer, K.R. (1986). Vocal affect expression: a review and a

model for future research. Psychological Bulletin, 99(2), 143-165. Re-

trieved 2014, July http://www.affective-sciences.org/system/files/biblio/

1986 Scherer PsyBull.pdf

90

http://dx.doi.org/10.3758/BF03196699

http://dx.doi.org/10.1016/j.jvoice.2009.04.002

http://informahealthcare.com/doi/pdf/10.1080/1476967031000090980

http://jslhr.pubs.asha.org/article.aspx?articleid=1781846

http://las.sagepub.com/content/18/2/145.full.pdf

http://las.sagepub.com/content/18/2/145.full.pdf

http://dx.doi.org/10.1177/0261927X8300200103

http://dx.doi.org/ 10.1177/0022022101032001009

http://www.affective-sciences.org/system/files/biblio/1986_Scherer_PsyBull.pdf

http://www.affective-sciences.org/system/files/biblio/1986_Scherer_PsyBull.pdf

Bibliography

[Sch03] Scherer, K.R. (2003). Vocal communication of emotion: A re-

view of research paradigms. Speech Communication, 40(1), 227-256.

doi:10.1016/S0167-6393(02)00084-5

[Schi11] Schiel, F. (2011). Perception of alcoholic intoxication in speech. In

Proceedings of the 12th Annual Conference of the International Speech

Communication Association (Interspeech 2011), Florence, Italy (3281-

3284). Retrieved 2014, July 10 from http://www.phonetik.uni-muenchen.

de/forschung/publikationen/Schiel-IS2011.pdf

[Schw12] Schweitzer, K. (2012). Frequency effects on pitch accents: Towards

an exemplar-theoretic approach to intonation (Doctoral Dissertation). Uni-

versity of Stuttgart, Stuttgart. Retrieved 2014, July 10 from http://elib.

uni-stuttgart.de/opus/volltexte/2013/7997/pdf/Schweitzer2012.pdf

[SF97] Sancier, M.L., Fowler, C.A. (1997). Gestural drift of a bilingual speaker

of Brazilian Portuguese and English. Journal of Phonetics, 25(4), 421-436.

doi:10.1006/jpho.1997.0051

[SG82] Street, R.L., Giles, H. (1982). Speech Accommodation Theory: A so-

cial cognitive approach to language use and speech behavior. In M. Roloff,

C. Berger (Eds.), Social cognition and communication. Beverly Hills: Sage.

[SGLP01] Shepard, C.A., Giles, H., Le Poire, B.A. (2001). Communication Ac-

commodation Theory. In W.P. Robinson, H. Giles (Eds.), The New Hand-

book of Language and Social Psychology (33-56). Chichester: Wiley.

[SH82] Sorensen, D., Horii, Y. (1982). Cigarette smoking and voice funda-

mental frequency. Journal of communication disorders, 15 (2), 135-144.

doi:10.1016/0021-9924(82)90027-2

[SL12] Schweitzer, A., Lewandowski, N. (2012). Accommodation of backchan-

nels in spontaneous speech (Abstract). Book of Abstracts of the In-

ternational Symposium on Imitation and Convergence in Speech, Aix-

en-Provence, France, 2012, September 3-5. Retrieved 2014, July

9 from http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schweitz/

docs/SchweitzerLewandowski2012.pdf

[SL13] Schweitzer, A. and Lewandowski, N. (2013). Convergence of ar-

ticulation rate in spontaneous speech. In Proceedings of the 14th An-

nual Conference of the International Speech Communication Associa-

tion (Interspeech 2013), Lyon, France (525-529). Retrieved 2014, July

91

http://dx.doi.org/10.1016/S0167-6393(02)00084-5

http://www.phonetik.uni-muenchen.de/forschung/publikationen/Schiel-IS2011.pdf

http://www.phonetik.uni-muenchen.de/forschung/publikationen/Schiel-IS2011.pdf

http://elib.uni-stuttgart.de/opus/volltexte/2013/7997/pdf/Schweitzer2012.pdf

http://elib.uni-stuttgart.de/opus/volltexte/2013/7997/pdf/Schweitzer2012.pdf

http://dx.doi.org/10.1006/jpho.1997.0051

http://dx.doi.org/10.1016/0021-9924(82)90027-2

http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schweitz/docs/SchweitzerLewandowski2012.pdf


Bibliography


docs/SchweitzerLewandowski2013.pdf

[SLD14] Schweitzer, A., Lewandowski, N., Dogil, G. (2014). Ad-

vancing corpus-based analyses of spontaneous speech: Switch to

GECO! In Proceedings of the 14th conference on Laboratory Phonol-

ogy (LabPhon), Tokyo, Japan (1 page). Retrieved 2014, July


docs/SchweitzerLewandowskiDogil2014.pdf

[SMA+82] Streeter, L.A., Macdonald, N.H., Apple, W., Krauss, R.M., Ga-

lotti, K.M. (1982). Acoustic and perceptual indicators of emotional

stress. Journal of the Acoustical Society of America, 73(4), 1354-1360.

doi:10.1121/1.389239

[SSL94] Stemple, J.C., Stanley, J., Lee, L. (1994). Objective measures of voice

production in normal subjects following prolonged voice use. Journal of

voice, 9(2), 127-133. doi:10.1016/S0892-1997(05)80245-0

[TE95] Traunmller, H., Eriksson, A. (1994). The frequency range of the voice

fundamental in the speech of male and female adults. Retrieved 2014, July

8 from http://www2.ling.su.se/staff/hartmut/f0 m&f.pdf

[TH80] Todt, E.H., Howell, R.J. (1980). Vocal cues as indices of

schizophrenia. Journal of Speech and Hearing Research, 23(3), 517-526.

doi:10.1044/jshr.2303.517

[Tri60] Triandis, H.C. (1960). Cognitive similarity and communication in a

dyad. Human Relations, 13(2), 175-183. doi:10.1177/001872676001300206

[VLKE85] Van Lancker, D., Kreimann, J., Emmorey, K. (1985). Familiar

voice recognition: Patterns and parameters. Part I: Recognition of back-

ward voices. Journal of Phonetics 13(1), 19-38. Retrieved 2014, July

10 from http://www.surgery.medsch.ucla.edu/glottalaffairs/papers/van%

20lancker%20-%20kreiman%20-%20emmorey%201985.pdf

[Web70] Webb, J.T. (1970). Interview synchrony. In A.W. Siegmann, B. Pope

(Eds.), Studies in dyadic communication: Proceedings of research confer-

ence on interview (115-133). New York: Pergamon.

92



http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schweitz/docs/SchweitzerLewandowskiDogil2014.pdf

http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schweitz/docs/SchweitzerLewandowskiDogil2014.pdf

http://dx.doi.org/10.1121/1.389239

http://dx.doi.org/10.1016/S0892-1997(05)80245-0

http://www2.ling.su.se/staff/hartmut/f0_m&f.pdf

http://dx.doi.org/10.1044/jshr.2303.517

http://dx.doi.org/10.1177/001872676001300206

http://www.surgery.medsch.ucla.edu/glottalaffairs/papers/van%20lancker%20-%20kreiman%20-%20emmorey%201985.pdf

http://www.surgery.medsch.ucla.edu/glottalaffairs/papers/van%20lancker%20-%20kreiman%20-%20emmorey%201985.pdf

Bibliography

[WJE+02] Wittels, P., Johannes, B., Enne, R., Kirsch, K. Gunga, H.C. (2002).

Voice monitoring to measure emotional load during short-term stress. Eu-

ropean Journal of Applied Physiology, 87(3), 278-282. doi:10.1007/s00421-

002-0625-1

[WM03] Wellham, N.V., Maclagan, M.A. (2003). Vocal fatigue: Cur-

rent knowledge and future directions. Journal of voice, 17(1), 21-30.

doi:10.1016/S0892-1997(03)00033-X

[WJM+04] Wallis, L., Jackson-Menaldi, C., Holland, W., Giraldo, A. (2004).

Vocal fold nodule vs. vocal fold polyp: Answer from surgical patholo-

gist and voice pathologist point of view. Journal of Voice, 18(1), 125129.

doi:10.1016/j.jvoice.2003.07.003

[WS72] Williams, C.E., Stevens, K.N. (1972). Emotions and speech: Some

acoustical correlates. Journal of the Acoustical Society of America , 52(4),

1238-1250. doi:10.1121/1.1913238

[YWB82] Yumoto, E., Wilbur, W.J., Baer, T. (1982). Harmonics-to-noise

ratio as an index of the degree of hoarseness. Journal of Acoustical So-

ciety of America, 71(6), 1544-1550. Retrieved 2014, July 10 from http:

//web.haskins.yale.edu/Reprints/HL0373.pdf

[YSO84] Yumoto E., Sasaki Y., Okamura H. (1984). Harmonics-to-noise ratio

and psychophysical measurement of the degree of hoarseness. Journal of

speech and hearing research, 27(1), 2-6. Retrieved 2014, July 10 from http:

//web.haskins.yale.edu/Reprints/HL0373.pdf

93

http://dx.doi.org/10.1007/s00421-002-0625-1

http://dx.doi.org/10.1007/s00421-002-0625-1

http://dx.doi.org/10.1016/S0892-1997(03)00033-X

http://dx.doi.org/10.1016/j.jvoice.2003.07.003

http://dx.doi.org/10.1121/1.1913238

http://web.haskins.yale.edu/Reprints/HL0373.pdf




Appendices

95

A Outliers - Individual speakers

Speaker& partof dia-logue

Speakersin dia-logue

F0 F0 SD F0Min

F0Max

Jitt Shim HNR

A-beg A & F ∗16 ◦3, ◦9,◦13

◦3

A-end A & F ∗4, ◦7 ∗14

F-beg A & F ◦2, ◦10 ◦14 ◦14 ◦4

F-end A & F ◦1 ∗9,◦1, ◦2,◦12,◦15,◦20

C-beg C & F ◦4 ∗12 ◦ 12

C-end C & F ∗2,∗21,∗23,◦7, ◦19

◦13 ◦8, ◦26

F-beg C & F ∗9, ◦1 ◦8 ◦21 ◦1,◦18,◦22

◦12

F-end C & F ∗16,◦2,◦17,◦20

◦4 ∗10,∗14,∗16,◦12

◦16 ◦11

D-beg D & F ◦15 ∗23, ◦5 ◦7

D-end D & F

F-beg D & F ◦4, ◦3 ◦16

F-end D & F ◦2, ◦11 ◦3,◦11,◦17

◦4, ◦18

H-beg H & F ∗6 ◦6 ◦3, ◦6 ◦18 ◦4, ◦6 ∗6

H-end H & F ∗4, ∗13 ∗13, ◦3 ◦13

F-beg H & F ◦9 ◦7 ◦7, ◦10 ◦4, ◦15 ◦4

F-end H & F ∗15,◦5, ◦11

◦10 ◦7, ◦10

97

A Outliers - Individual speakers


Speakersin dia-logue

F0 F0 SD F0Min

F0Max

Jitt Shim HNR

J-beg J & F ◦3, ◦4,◦10

◦2, ◦15 ◦5, ◦8

J-end J & F ∗11 ◦16

F-beg J & F ◦1 ∗2, ◦1,◦19,◦20,◦22

◦1, ◦3 ◦3, ◦18 ∗16 ◦7

F-end J& F ◦14 ◦14 ∗14,◦1, ◦2

K-beg K & F ◦1 ◦1 ◦10,◦13

◦2

K-end K & F ◦21

F-beg K & F ◦8 ◦1, ◦5,◦7, ◦8

F-end K & F ◦15, 20 ◦8 ◦5

Outliers for the beginnings and ends for the individual speakers of each dialogue,recognized with the help of Boxplots, generated with SPSS [Inc12]. Outliers markedby ∗are extreme values, defined as a score that is greater than 3 box lengths awayfrom the upper or lower hinge of the box [JL13, p. 242]. Those are discarded fromthe analyses. Outliers, marked by ◦, defined as samples in score that is between 1.5and 3 box lengths away from the upper or lower hinge of the box [JL13, p. 242], areincluded in the analyses. The numbers next to the outliers mark the correspondingsamples.

98

B Outliers - speaker F


F0 F0 SD F0 Min F0 Max Jitt Shim HNR

F-beg ◦52,◦125

◦54, ◦55 ◦2, ◦10,◦38,◦45,◦52,◦54,◦55,◦67, ◦68

∗35,◦76,◦85, ◦93

◦35

F-end ◦38,◦45,◦54,◦129

◦26,◦104,◦129

∗46,∗54,∗60,∗129,◦9, ◦45

◦75,◦77,◦85,◦101,◦115

◦75,◦101

◦75,◦101

Outliers for the sample collections of the beginning and end of speaker F in allsix dialogues, recognized with the help of Boxplots, generated with SPSS [Inc12].Outliers marked by ∗are extreme values, defined as a score that is greater than 3box lengths away from the upper or lower hinge of the box [JL13, p. 242]. Thoseare discarded from the analyses. Outliers, marked by ◦, defined as samples in scorethat is between 1.5 and 3 box lengths away from the upper or lower hinge of thebox [JL13, p. 242], are included in the analyses. The numbers next to the outliersmark the corresponding samples.

99

C Results of Analysis 1 -Individual speakers

C.1 Dialogue A-F

Speaker A









Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker A from the beginnings and ends of the dialogue with speaker F.


Pair 1 F0 Mean -3.814542 19.328867 3.945488 -0.967 23 0.344Pair 2 F0 SD 3.413625 19.856913 4.053275 0,842 23 0.408Pair 3 F0 Min -1.802957 20.017503 4.173938 -0.432 22 0.670Pair 4 F0 Max 2.112417 87.497135 17.860278 0.118 23 0.907Pair 5 Jitter 0.233261 0.624782 0.130276 1.791 22 0.087Pair 6 Shimmer 0.794750 2.020564 0.412446 1.927 23 0.066Pair 7 HNR -1.074208 2.213673 0.451864 -2.377 23 0.026

Results of the paired t-tests, comparing samples of speaker A from the beginningand the end. The parameter of HNR is significant and shimmer exhibits a tendencyfor being significant.

101

C Results of Analysis 1 - Individual speakers

Speaker F









Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker F from the beginnings and ends of the dialogue with speaker A.


Pair 1 F0 Mean 0.868682 21.487454 4.581141 0.190 21 0.851Pair 2 F0 SD 10.291136 14.985259 3.194868 3.221 21 0.004Pair 3 F0 Min -21.653045 40.206783 8.572115 -2.526 21 0.020Pair 4 F0 Max 32.559095 49.352425 10.769582 3.023 20 0.007Pair 5 Jitter 0.040045 0.587328 0.125219 0.320 21 0.752Pair 6 Shimmer 0.344455 2.167322 0.462075 0.745 21 0.464Pair 7 HNR -0.369909 2.474797 0.527629 -0.701 21 0.491

Results of the paired t-tests, comparing samples of speaker F from the beginningand the end within the dialogue with speaker A. Significant parameters are F0 SD,F0 Min and F0 Max.


A (∗) ∗F ∗ ∗ ∗

Significant changes of voice correlates for the speakers A and F in their dialogue.Significant values are marked by ∗, tendencies are marked by (∗).

102

C.2 Dialogue C-F

C.2 Dialogue C-F

Speaker C









Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker C from the beginnings and ends of the dialogue with speaker F.


Pair 1 F0 Mean -1.397276 22.832798 4.239944 -0.330 28 0.744Pair 2 F0 SD 1.375690 15.269180 2.835416 0.485 28 0.631Pair 3 F0 Min 12.332654 45.460089 8.915457 1.383 25 0.179Pair 4 F0 Max 2.060517 48.476788 9.001914 0.229 28 0.821Pair 5 Jitter -0.198179 0.981307 0.185450 -1.069 27 0.295Pair 6 Shimmer -0.696655 2.591666 0.481260 -1.448 28 0.159Pair 7 HNR 0.034393 3.961333 0.748622 0.046 27 0.964

Results of the paired t-tests, comparing samples of speaker C from the beginningand the end. None of the parameters is significant.

103


Speaker F









Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker F from the beginnings and ends of the dialogue with speaker C.


Pair 1 F0 Mean 8.896263 19.409138 4.452762 1.998 18 0.061Pair 2 F0 SD -0.825714 11.686230 2.550144 -0.324 20 0.749Pair 3 F0 Min 29.990833 37.749193 8.897570 3.371 17 0.004Pair 4 F0 Max 16.233857 52.550037 11.467358 1.416 20 0.172Pair 5 Jitter -0.234238 0.782830 0.170828 -1.371 20 0.186Pair 6 Shimmer 0.181381 2.025149 0.441924 0.410 20 0.686Pair 7 HNR 0.234524 3.193329 0.696842 0.337 20 0.740

Results of the paired t-test, comparing samples of speaker F from the beginning andthe end within the dialogue with speaker C. F0 Min changed significantly, changesof F0 Mean showed a tendency for being significant.

Summary


CF (∗) ∗

Significant changes of voice correlates for the speakers C and F in their dialogue.Significant values are marked by ∗, tendencies are marked by (∗).

104

C.3 Dialogue H-F

C.3 Dialogue H-F

Speaker H









Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker H from the beginnings and ends of the dialogue with speaker F.


Pair 1 F0 Mean 3.052750 10.513544 2.350900 1.299 19 0.210Pair 2 F0 SD 5.624190 11.547481 2.519867 2.232 20 0.037Pair 3 F0 Min 0.946045 53.930192 11.497956 0.082 21 0.935Pair 4 F0 Max 13.455136 43.213473 9.13144 0.082 21 0.159Pair 5 Jitter -0.214091 1.078386 0.229913 -0.931 21 0.362Pair 6 Shimmer -0.733773 2.405927 0.512945 -1.431 21 0.167Pair 7 HNR 0.319333 2.311162 0.504337 0.633 20 0.534

Results of the paired t-tests, comparing samples of speaker H from the beginningand the end. The changes in F0 SD are statistically significant.

105


Speaker F









Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker F from the beginnings and ends of the dialogue with speaker H.



Results of the paired t-test, comparing samples of speaker F from the beginning andthe end within the dialogue with speaker H. No significant changes were found.

Summary


H ∗F

Significant changes of voice correlates for the speakers H and F in their dialogue.Significant values are marked by ∗, tendencies are marked by (∗).

106

C.4 Dialogue J-F

C.4 Dialogue J-F

Speaker J









Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker J from the beginnings and ends of the dialogue with speaker F.


Pair 1 F0 Mean 9.699042 23.403858 4.777292 2.030 23 0.054Pair 2 F0 SD 6.418750 15.408817 3.145312 2.041 23 0.053Pair 3 F0 Min -25.626583 42.231507 8.620470 -2.973 23 0.007Pair 4 F0 Max 28.064875 60.295057 12.307677 2.280 23 0.032Pair 5 Jitter 0.140652 0.573483 0.119579 1.176 22 0.252Pair 6 Shimmer -0.218917 2.267400 0.462831 -0.473 23 0.641Pair 7 HNR -0.423958 2.833680 0.578423 -0.733 23 0.471

Results of the paired t-tests, comparing samples of speaker J from the beginning andthe end. Changes in of F0 Min and F0 Max were statistically significant, changesin F0 Mean and F0 SD showed a tendency of being significant.

107


Speaker F









Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker F from the beginnings and ends of the dialogue with speaker J.


Pair 1 F0 Mean 2.143650 23.914759 5.347503 0.401 19 0.693Pair 2 F0 SD -4.666650 13.009558 2.909026 -1.604 19 0.125Pair 3 F0 Min 21.319579 45.457736 10.428720 2.044 18 0.056Pair 4 F0 Max 6.892350 44.333690 9.913314 0.695 19 0.495Pair 5 Jitter -0.044050 0.835138 0.186743 -0.236 19 0.816Pair 6 Shimmer -0.701579 2.775553 0.636756 -1.102 18 0.285Pair 7 HNR -0.178474 3.163230 0.725695 -0.246 18 0.809

Results of the paired t-tests, comparing samples of speaker F from the beginningand the end within the dialogue with speaker J. No significant changes were found.The parameter of F0 Min showed a tendency for changing significantly.

Summary


J (∗) ∗ ∗ ∗F (∗)

Significant changes of voice correlates for the speakers J and F in their dialogue.Significant values are marked by ∗, tendencies are marked by (∗).

108

C.5 Dialogue K-F

C.5 Dialogue K-F

Speaker K









Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker K from the beginnings and ends of the dialogue with speaker F.


Pair 1 F0 Mean -4.056190 35.836419 7.820148 -0.519 20 0.610Pair 2 F0 SD -2.036762 14.710574 3.210110 -0.634 20 0.533Pair 3 F0 Min 9.935619 31.118958 6.790713 1.463 20 0.159Pair 4 F0 Max -15.592190 67.937001 14.25069 -1.052 20 0.305Pair 5 Jitter -0.239095 0.30475 0.159403 -1.500 20 0.149Pair 6 Shimmer -0.335048 1.755003 0.382973 -0.875 20 0.392Pair 7 HNR 0.713286 2.170863 0.473721 1.506 20 0.148

Results of the paired t-tests, comparing samples of speaker K from the beginningand the end. No significant changes were found.

109


Speaker F









Mean, standard deviation (SD) and standard error mean for the comparison of thesamples of speaker F from the beginnings and ends of the dialogue with speaker K.



Results of the paired t-tests, comparing samples of speaker F from the beginningand the end within the dialogue with speaker K. The parameters of F0 SD and HNRchanged statistically significant.

Summary


KF ∗ ∗

Significant changes of voice correlates for the speakers K and F in their dialogue.Significant values are marked by ∗, tendencies are marked by (∗).

110

D Results of Analysis 3 -Position in dialogue

D.1 Dialogue A-F

F0 Mean


beginning A 198.8632 24 11.60327 2.36851beginning F 205.6947 23 13.93698 2.90606

end A 202.67775 24 13.600890 2.776270end F 204.61132 22 13.096973 2.792284

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themean fundamental frequency (F0 Mean) in Hertz (Hz) of the speakers A and F atthe beginning and the end of the dialogue.



Results of the ANOVA for the mean fundamental frequency (F0 Mean) for thespeakers A and F at the beginning and at the end of the dialogue.

111

D Results of Analysis 3 - Position in dialogue

F0 SD



end A 37.54388 24 14.190584 2.896641end F 16.94927 22 14.190584 1.309774

Mean, standard deviation (SD) and standard error mean (St. Error Mean) forthe standard deviation of the fundamental frequency (F0 SD) in Hertz (Hz) of thespeakers A and F at the beginning and the end of the dialogue.



Results of the ANOVA for the standard deviation of the fundamental frequency (F0SD) for the speakers A and F at the beginning and at the end of the dialogue.

F0 Min



end A 98.86592 24 18.902999 3.858559end F 150.67495 22 29.876391 6.369668

Mean, standard deviation (SD) and standard error mean (St. Error Mean) forminimum of the fundamental frequency (F0 SD) in Hertz (Hz) of the speakers Aand F at the beginning and the end of the dialogue.



Results of the ANOVA for the minimal fundamental frequency (F0 Min) for thespeakers A and F at the beginning and at the end of the dialogue.

112

D.1 Dialogue A-F

F0 Max



end A 306.25752 23 55.679642 11.610008end F 266.68143 21 22.423106 4.893123

Mean, standard deviation (SD) and standard error mean (St. Error Mean) formaximum of the fundamental frequency (F0 SD) in Hertz (Hz) of the speakers Aand F at the beginning and the end of the dialogue.



Results of the ANOVA for the maximal fundamental frequency (F0 Max) for thespeakers A and F at the beginning and at the end of the dialogue.

Jitter



end A 2.31452 23 0.378722 0.078969end F 1.94086 22 0.440796 0.093978

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for jitterin percent (%) of the speakers A and F at the beginning and the end of the dialogue.



Results of the ANOVA for jitter for the speakers A and F at the beginning and atthe end of the dialogue.

113


Shimmer



end A 8.70875 24 1.415823 0.289004end F 8.20800 22 1.314742 0.280304

Mean, standard deviation (SD) and standard error mean (St. Error Mean) forshimmer in percent (%) of the speakers A and F at the beginning and the end ofthe dialogue.



Results of the ANOVA for shimmer for the speakers A and F at the beginning andat the end of the dialogue.

HNR



end A 16.78892 24 1.438183 0.293568end F 18.39323 22 1.868531 0.398372

Mean, standard deviation (SD) and standard error mean (St. Error Mean) forharmonics-to-noise ratio (HNR) of the speakers A and F at the beginning and theend of the dialogue.



Results of the ANOVA for the harmonics-to-noise ratio (HNR) for the speakers Aand F at the beginning and at the end of the dialogue.

114

D.2 Dialogue D-F

D.2 Dialogue D-F

F0 Mean


beginning D 229.59724 25 12.807842 3.805483beginning F 217.70765 20 20.719282 4.285673

end D 232.93950 22 15.574690 2.730641end F 210.66168 22 20.719282 4.417366

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for meanof the fundamental frequency (F0 Mean) in Hertz (Hz) of the speakers D and F atthe beginning and the end of the dialogue.



Results of the ANOVA for the mean fundamental frequency (F0 Mean) for thespeakers D and F at the beginning and at the end of the dialogue.

F0 SD



end D 28.94427 22 12.411122 2.646060end F 22.28627 22 8.167158 1.741244

Mean, standard deviation (SD) and standard error mean (St. Error Mean) forthe standard deviation of the fundamental frequency (F0 SD) in Hertz (Hz) of thespeakers D and F at the beginning and the end of the dialogue.



Results of the ANOVA for the standard deviation of the fundamental frequency (F0SD) for the speakers D and F at the beginning and at the end of the dialogue.

115


F0 Min



end D 154.76977 22 50.631920 10.794762end F 136.24964 22 37.642780 8.025468

Mean, standard deviation (SD) and standard error mean (St. Error Mean) forminimum of the fundamental frequency (F0 Min) in Hertz (Hz) of the speakers Dand F at the beginning and the end of the dialogue.



Results of the ANOVA for the minimal fundamental frequency (F0 Min) for thespeakers D and F at the beginning and at the end of the dialogue.

F0 Max



end D 330.85964 22 31.889633 6.798893end F 295.33745 22 58.697028 12.514248

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themaximum of the fundamental frequency (F0 Max) in Hertz (Hz) of the speakers Dand F at the beginning and the end of the dialogue.



Results of the ANOVA for the maximal fundamental frequency (F0 Max) for thespeakers D and F at the beginning and at the end of the dialogue.

116

D.2 Dialogue D-F

Jitter



end D 1.85718 22 0.377235 0.080427end F 1.90209 22 0.463344 0.098785

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for jitterin percent (%) of the speakers D and F at the beginning and the end of the dialogue.



Results of the ANOVA for jitter for the speakers D and F at the beginning and atthe end of the dialogue.

Shimmer



end D 6.75695 22 0.938097 0.200003end F 6.72559 22 1.303650 0.277939

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themaximum of the fundamental frequency (F0 Max) in Hertz (Hz) of the speakers Dand F at the beginning and the end of the dialogue.



Results of the ANOVA for shimmer for the speakers D and F at the beginning andat the end of the dialogue.

117


HNR



end D 18.39091 22 1.756843 0.374560end F 20.40582 22 1.846650 0.393707

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theharmonics-to-noise ratio (HNR) in decibel (dB) of the speakers D and F at thebeginning and the end of the dialogue.



Results of the ANOVA for the harmonics-to-noise ratio (HNR) for the speakers Dand F at the beginning and at the end of the dialogue.

D.3 Dialogue H-F

F0 Mean


beginning H 196.24725 24 7.490674 1.529027beginning F 199.75152 23 14.941827 3.115586

end H 193.90176 21 8.212989 1.792221end F 195.57645 22 11.038573 2.353432

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themean of the fundamental frequency (F0 Mean) in Hertz (Hz) of the speakers H andF at the beginning and the end of the dialogue.



Results of the ANOVA for the mean fundamental frequency (F0 Mean) for thespeakers H and F at the beginning and at the end of the dialogue.

118

D.3 Dialogue H-F

F0 SD



end H 19.05981 21 6.245378 1.362853end F 14.27959 22 2.294422 0.489172

Mean, standard deviation (SD) and standard error mean (St. Error Mean) forstandard deviation of the fundamental frequency (F0 SD) in Hertz (Hz) of thespeakers H and F at the beginning and the end of the dialogue.



Results of the ANOVA for the standard deviation of the fundamental frequency (F0SD) for the speakers H and F at the beginning and at the end of the dialogue.

F0 Min



end H 137.51386 22 33.081591 7.053019end F 160.54162 21 11.656008 2.543549

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theminimum of the fundamental frequency (F0 Min) in Hertz (Hz) of the speakers Hand F at the beginning and the end of the dialogue.



Results of the ANOVA for the minimal fundamental frequency (F0 Min) for thespeakers H and F at the beginning and at the end of the dialogue.

119


F0 Max



end H 273.86282 22 29.113741 6.207070end F 257.80345 22 14.800031 3.155377

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themaximum of the fundamental frequency (F0 Max) in Hertz (Hz) of the speakers Hand F at the beginning and the end of the dialogue.



Results of the ANOVA for the maximal fundamental frequency (F0 Max) for thespeakers H and F at the beginning and at the end of the dialogue.

Jitter


beginning H 2.24668 25 .574052 0.114810beginning F 2.30130 23 0.620540 0.129392

end H 2.41505 22 0.707500 0.150840end F 2.28414 22 0.699479 0.149129

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for jitterin percent (%) of the speakers H and F at the beginning and the end of the dialogue.



Results of the ANOVA for jitter for the speakers H and F at the beginning and atthe end of the dialogue.

120

D.3 Dialogue H-F

Shimmer



end H 7.73232 22 1.708574 0.364269end F 7.31000 22 2.131832 0.454508

Mean, standard deviation (SD) and standard error mean (St. Error Mean) forshimmer in percent (%) of the speakers H and F at the beginning and the end ofthe dialogue.



Results of the ANOVA for shimmer for the speakers H and F at the beginning andat the end of the dialogue.

HNR



end H 18.90482 22 1.526171 0.325381end F 19.49045 22 2.408319 0.513455

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theharmonics-to-noise ratio (HNR) in decibel (dB) of the speakers H and F at thebeginning and the end of the dialogue.



Results of the ANOVA for the harmonics-to-noise ratio (HNR) for the speakers Hand F at the beginning and at the end of the dialogue.

121


D.4 Dialogue J-F

F0 Mean


beginning J 229.51 26 17.16 3.36beginning F 211.08 23 13.02 2.71

end J 219.89 24 15.17 3.10end F 210.07 20 18.06 4.04

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themean fundamental frequency (F0 Mean) in Hertz (Hz) of the speakers J and F atthe beginning and the end of the dialogue.



Results of the ANOVA for the mean fundamental frequency (F0 Mean) for thespeakers J and F at the beginning and at the end of the dialogue.

FO SD



end J 29.16 24 10.22 2.09end F 23.24 20 10.58 2.37

Mean, standard deviation (SD) and standard error mean (St. Error Mean) forthe standard deviation of the fundamental frequency (F0 SD) in Hertz (Hz) of thespeakers J and F at the beginning and the end of the dialogue.



Results of the ANOVA for the standard deviation of the fundamental frequency (F0SD) for the speakers J and F at the beginning and at the end of the dialogue.

122

D.4 Dialogue J-F

F0 Min



end J 147.47 24 28.60 5.84end F 143.44 20 38.52 8.61

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theminimal fundamental frequency (F0 Min) in Hertz (Hz) of the speakers J and F atthe beginning and the end of the dialogue.



Results of the ANOVA for the minimal fundamental frequency (F0 Min) for thespeakers J and F at the beginning and at the end of the dialogue.

F0 Max



end J 324.47 24 42.04 8.58end F 273.05 20 30.69 6.86

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themaximal fundamental frequency (F0 Max) in Hertz (Hz) of the speakers J and F atthe beginning and the end of the dialogue.



Results of the ANOVA for the maximal fundamental frequency (F0 Max) for thespeakers J and F at the beginning and at the end of the dialogue.

123


Jitter



end J 2.17 23 0.29 0.06end F 1.92 20 0.58 0.13

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for jitterin percent (%) of the speakers J and F at the beginning and the end of the dialogue.


Intercept 364.782 1 364.782 1380.755 0.000Position 0.021 1 0.021 0.080 0.778Error 21.928 83 .264

Results of the ANOVA for jitter for the speakers J and F at the beginning and atthe end of the dialogue.

Shimmer



end J 9.51 24 1.72 0.35end F 7.14 20 2.15 0.48

Mean, standard deviation (SD) and standard error mean (St. Error Mean) forshimmer in percent (%) of the speakers J and F at the beginning and the end of thedialogue.



Results of the ANOVA for shimmer for the speakers J and F at the beginning andat the end of the dialogue.

124

D.5 Dialogue K-F

HNR



end J 16.73 24 2.05 0.42end F 20.46 19 2.25 0.52

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theharmonics-to-noise ratio (HNR) in decibel (dB) of the speakers J and F at thebeginning and the end of the dialogue.



Results of the ANOVA for the harmonics-to-noise ratio (HNR) for the speakers Jand F at the beginning and at the end of the dialogue.

D.5 Dialogue K-F

F0 Mean


beginning K 192.83 22 16.13 3.44beginning F 198.34 20 19.25 4.31

end K 196.43 21 27.22 5.94end F 195.55 22 17.05 3.63

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themean fundamental frequency (F0 Mean) in Hertz (Hz) of the speakers K and F atthe beginning and the end of the dialogue.



Results of the ANOVA for the mean fundamental frequency (F0 Mean) for thespeakers K and F at the beginning and at the end of the dialogue.

125


F0 SD



end K 27.96 21 9.91 2.16end F 19.92 22 9.57 2.04

Mean, standard deviation (SD) and standard error mean (St. Error Mean) forthe standard deviation of the fundamental frequency (F0 SD) in Hertz (Hz) of thespeakers K and F at the beginning and the end of the dialogue.



Results of the ANOVA for the standard deviation of the fundamental frequency (F0SD) for the speakers K and F at the beginning and at the end of the dialogue.

F0 Min



end K 125.87 21 21.44 4.68end F 142.12 22 26.30 5.61

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theminimal fundamental frequency (F0 Min) in Hertz (Hz) of the speakers K and F atthe beginning and the end of the dialogue.



Results of the ANOVA for the minimal fundamental frequency (F0 Min) for thespeakers K and F at the beginning and at the en of the dialogue.

126

D.5 Dialogue K-F

F0 Max



end K 296.80 21 48.85 10.66end F 268.95 22 38.98 8.31

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for themean fundamental frequency (F0 Max in Hertz (Hz) of the speakers K and F at thebeginning and the end of the dialogue.



Results of the ANOVA for the maximal fundamental frequency (F0 Max) for thespeakers K and F at the beginning and at the end of the dialogue.

Jitter



end K 2.53 21 0.54 0.12end F 2.02 22 0.53 0.11

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for jitterin percent (%) of the speakers K and F at the beginning and the end of the dialogue.



Results of the ANOVA for jitter for the speakers K and F at the beginning and atthe end of the dialogue.

127


Shimmer



end K 7.48 21 1.10 0.24end F 7.04 22 1.03 0.22

Mean, standard deviation (SD) and standard error mean (St. Error Mean) forshimmer in percent (%) of the speakers K and F at the beginning and the end ofthe dialogue.



Results of the ANOVA for shimmer for the speakers K and F at the beginning andat the end of the dialogue.

HNR



end K 17.85 21 1.58 0.35end F 20.30 22 1.81 0.39

Mean, standard deviation (SD) and standard error mean (St. Error Mean) for theharmonics-to-noise ratio (HNR) in decibel (dB) of the speakers K and F at thebeginning and the end of the dialogue.



Results of the ANOVA for the harmonics-to-noise ratio (HNR) for the speakers Kand F at the beginning and at the end of the dialogue.

128

Accommodation of voice parameters in dialogue

Documents

Transcript of Accommodation of voice parameters in dialogue