Differential Short-Term Memorization for Vocal and Instrumental Rhythms

Running head: SHORT-TERM MEMORY FOR VOCAL & INSTRUMENTAL RHYTHMS 1

Differential Short-Term Memorization for Vocal and Instrumental Rhythms

Niall A.M. Klyn**, Udo Will*†, YongJeon Cheong*, and Erin T. Allen*

*School of Music, The Ohio State University, Columbus OH, USA **Department of Speech and Hearing Science, The Ohio State University, Columbus OH, USA

School of Music. 1866 N. College Road, 110 Weigel Hall, Columbus OH, 43210. 614-292-6389. Department of Speech and Hearing Science. 1070 Carmack Road, 110 Pressey Hall, Columbus OH, 43210. 614-292-8207.

[email protected]; [email protected]; [email protected]; [email protected]

This is a Prepublication version of the article to be published by Taylor & Francis in Memory (online publication date 08/17/2015), available online: http://wwww.tandfonline.com/10.1080/09658211.2015.1050400.

e-prints of the published article are available at:

http://www.tandfonline.com/eprint/rHjz2dnkHq33EaiMQPm3/full

† Corresponding author: Udo Will, School of Music. 1866 N. College Road, 101G Weigel Hall,

Columbus OH, 43210. Phone: 614-292-6389, email: [email protected]

mailto:[email protected]




http://www.tandfonline.com/

http://www.tandfonline.com/eprint/rHjz2dnkHq33EaiMQPm3/full

SHORT-TERM MEMORY FOR VOCAL & INSTRUMENTAL RHYTHMS 2

Differential Short-Term Memorization for Vocal and Instrumental Rhythms

This study explores differential processing of vocal and instrumental rhythms in short-term memory with three decision (same/different judgments) and one reproduction experiment. In the first experiment memory performance declined for delayed versus immediate recall, with accuracy for the two rhythms being affected differently: Musicians performed better than non-musicians on clapstick but not on vocal rhythms, and musicians were better on vocal rhythms in the same than in the different condition. Results for experiment two showed that concurrent sub-vocal articulation and finger-tapping differentially affected the two rhythms and same/different decisions, but produced no evidence for articulatory loop involvement in delayed decision tasks. In a third experiment which tested rhythm reproduction, concurrent sub-vocal articulation decreased memory performance, with a stronger deleterious effect on the reproduction of vocal than of clapstick rhythms. This suggests that the articulatory loop may only be involved in delayed reproduction not in decision tasks. The fourth experiment tested whether differences between filled and empty rhythms (continuous vs. discontinuous sounds) can explain the different memorization of vocal and clapstick rhythms. Though significant differences were found for empty and filled instrumental rhythms, the differences between vocal and clapstick can only be explained by considering additional voice specific features.

Keywords: short-term memory, vocal rhythm, instrumental rhythm, rhythm encoding, articulatory loop, task dependency

Introduction:

Research on processing of sounds from different sources (Belin, Zatorre, Lafaille, Ahad, & Pike,

2000; Bent, Bradlow, & Wright, 2006; Levy, Granot, & Bentin, 2001, 2003; Vouloumanos,

Kiehl, Werker, & Liddle, 2001; Zatorre, Belin, & Penhune, 2002) indicates that sound source

identification is a well-developed human ability, and that vocal and non-vocal sounds are

processed differently. In addition, vocal and instrumental melodic contours have been shown to

differentially affect performance of word repetition tasks in tone language speakers (Poss, 2012;

Poss, Hung, & Will, 2008;). However, studies on auditory rhythm have not systematically

investigated whether sounds from different sources, e.g. vocal versus non-vocal, lead to

differential rhythm processing. Studies have investigated temporal coding (Deutsch, 1986; Povel,

1981; Povel & Essens, 1985), implicitly assuming that rhythm processing is an essentially

amodal process, though there is evidence that modality of stimulus presentation affects temporal

encoding (Glenberg & Jona, 1991). A recent study by Hung (2011) advanced evidence that even

within the auditory modality rhythm processing is not independent of features of the sound

source. Her study, in which subjects had to decide whether two sequentially presented rhythms

were the same or different, found significant behavioural (reaction time, accuracy) and imaging

(fMRI) differences for vocal and instrumental rhythms.


The current study extends research on rhythm memory by asking whether vocal and

instrumental rhythms are processed differently in short-term memory. It links questions arising

from recent timing research that suggests that multiple distinct processes underlie timing (Coull,

Cheng, & Meck, 2010; Wiener, Mattel, & Coslett, 2011), with questions of how rhythms

produced by different sources are represented and maintained in memory. Consequently, we

investigate how differential processing of vocal and instrumental rhythms is affected by retention

span, musical training, the type of responses to be made (same/different decision or

reproduction), distractor tasks, and acoustic features of the sounds that form the rhythms. The

current study presents four experiments that investigate short-term memory for vocal and

clapstick rhythms: experiment 1 examines changes in rhythm discrimination at short (0.5s) and

long (12.5s) inter-stimulus intervals, whether there is trace decay for rhythm and whether vocal

and clapstick rhythms are affected in similar ways; experiment 2 investigates a potential

rehearsal mechanism for rhythm memory by examining the effect of concurrent motor tasks on

rhythm discrimination; experiment 3 examines effects of concurrent motor tasks on rhythm

reproduction; finally, experiment 4 investigates the role of one acoustical difference between

clapstick and vocal sounds, that of “empty” and “filled” rhythms. Empty rhythms use sounds

with a brief onset and offset. Thus, there is no steady state in between events. Filled rhythms,

then, are those in which the onsets are proceeded by a more continuous sound before the offsets,

which may coincide with the following event onset. The analytical procedures we use are guided

by our data and the choice of experimental variables. It might seem that a signal detection (SDT)

approach would be appropriate for this study, as SDT analyses have been used for same/different

decision experiments (MacMillan & Creelman, 2005). Classic signal detection theory assumes

that recognition judgments can be analysed using a unidimensional value of familiarity.

However, there is evidence that other sources, e.g. neural synchronization and accumulated

information, can drive new/old or same/different decisions (Johns, Jones, & Mewhort, 2012;

Finnigan, Humphreys, Dennis, & Geffen, 2002; Mewhort & Johns, 2005) and that these

decisions are based on different cognitive processes involving different types of comparison

(Bagnara, Boles, Simion, & Umiltà, 1982; Keuss, 1977; Markman & Gentner, 2005). As we

were particularly interested in whether and how the processes underlying same/different

decisions affect rhythm memory processing we did not use SDT for our analysis. An SDT

analysis would not permit considering the same/different factor as a separate experimental

variable, instead collapsing the factor levels as two distributions on the same underlying

parameter. We did, however, perform supplementary bias analyses for the first two experiments

(see results section for experiment 1 and 2) to test whether our results could be explained by bias

changes in subjects’ decision making.

The experiments were approved by the Institutional Review Board of The Ohio State

University.


Experiment 1:

1.1. Introduction:

Hung’s (2011) experiment provided evidence effects of clapstick and vocal rhythms on

same/different decision tasks. Given these differences were found at the short ISI (~0.5s), we ask

whether these differences persist or change for longer retention spans. There is ample evidence

to show that auditory memory performance degrades over longer periods (for reviews, cf.

Baddeley, 1990, 2010; Cowan, 1997, 2008; Neath & Surprenant, 2003), but it is unclear whether

this also holds for rhythm memory and whether memory for vocal and clapstick rhythms will be

affected differentially. Percussion rhythms, like those of the clapstick used here, may be simply

described by their relative onset and offsets, while vocal rhythms include additional features like

changes in pitch, frequency spectra and timbre. This greater complexity could induce greater

degradation in memory and we hypothesized that for delayed responses the performance on the

vocal rhythm task would be significantly degraded compared to the clapstick task.

A further question that arises concerns the influence of musical training (Hung, 2011 only

examined trained musicians). Differences between musicians and non-musicians have received a

fair bit of attention (Gaser & Schlaug, 2003; Koelsch, Schröger, & Tervaniemi, 1999;

Musacchia, Sams, Skoe, & Kraus, 2007; Musacchia, Strait, & Kraus, 2008; Parbery-Clark, Skoe,

Lam, & Kraus, 2009; Schaal, Banissy, & Lange, 2015; Zatorre, 1998), but it is unclear whether

any effects of musical training will extend to differential processing in memory for rhythm.

Several studies have shown superior performance on memory tasks by musicians (Jakobson,

Lewycky, Kilgour, & Stoesz, 2008; Kilgour, Jakobson, & Cuddy, 2000; Schaal et al., 2015;

Tervaniemi, Rytkönen, Schröger, Ilmoniemi, & Näätänen, 2001), which would indicate that

musically trained subjects should exhibit generally superior performance at longer ISI than non-

musicians, but whether this will affect memory for vocal versus clapstick rhythms is uncertain.

Moreover, musical training may specifically improve memory for rhythm by fostering strategies

like categorization of rhythmic intervals in terms of a set of learned duration values (e.g. quarter

note, eighth note, etc.) and/or visualized representation of rhythms using musical notation.

Indeed, in a recent article Schaal et al. (2015) proposed a rhythm span task that uses a

same/different decision task to estimate the number of "rhythm elements" that listeners can retain

over a 2 second delay. In the first experiment using their novel procedure, musicians had a

significantly longer estimated rhythm span than non-musicians. Due to the substantial amount of

ear-training most university-level musicians receive and the previous evidence of superior

working memory in musicians, we hypothesized that musicians would show superior accuracy

for all conditions in the experiment. As we are unaware of any evidence for differences in

reaction time (henceforth RT) between these groups, we hypothesized there would be no

difference in our study. Finally, we anticipated that our results would be consistent with previous

research in which same/different decisions led to significantly different RTs (Proctor, 1981).


1.2. Methods:

Participants:

For the purposes of our study, subjects were classified as musicians if they satisfied all of three

criteria: 1) at least 5 years of formal musical training, 2) ongoing active musical engagement at

the time of participation, and 3) self-identification as a musician. Any participant who did not

meet all three criteria was therefore categorized as a non-musician. Our musicians played a wide

range of instruments, but this study was not designed to detect potential differences between

different instrumentalists, different durations of training, or techniques. Therefore no attempt was

made to analyse the data along these lines. Twenty-five participants who reported normal hearing

took part in the first experiment; 10 non-musicians (avg. years of musical training = 0.66 years,

min. 0, max. 4; 6 female; avg. age = 24 years, min. 20, max. 30), and 15 musicians (avg. years of

musical training = 12.3 years, min. 5, max. 20; 4 female; avg. age = 23 years, min. 18, max. 31;

1 left-handed). All subjects of this study were members of the Ohio State University community

who participated for either monetary compensation or course credit, and all signed informed

consent prior to participation.

Stimuli:

The stimuli for the following experiments were created using excerpts from a CD of field

recordings of the Dyirbal people from Queensland, Australia (Dixon & Koch, 1996) by Hung

(2011). These recordings were selected to use real-world stimuli and to avoid possible confounds

due to semantic processing and musical familiarity. From 10 recordings of male performances

we selected brief excerpts (mean length 1.73s, range 1.5-2s). To ensure a consistent clapstick

signal for all vocal excerpts, one clapstick sound from the CD was used to replace the existing

clapstick sounds if the recording had a pre-existing accompaniment or to insert a new pattern if

the recording was a solo voice excerpt. Any new clapstick pattern was constructed following one

of the clapstick patterns available on the CD. The combination of voice and clapstick rhythms

was done to assure identical stimulus input for the main tasks, the comparison of two subsequent

vocal or instrumental rhythm patterns: with identical stimuli, any difference in results cannot be

attributed to differences in the stimuli. The resulting 10 parent stimuli were subsequently

modified to produce three variant stimuli for each parent, one with vocal rhythm changed but

clapstick the same, one with vocal rhythm the same but clapstick rhythm changed, and one with

a change to both the clapstick and the voice. If changes were only introduced to one rhythm and

not the other, subjects could more easily use the non-task rhythm for comparison when making a

decision. Therefore we decided to include all four types of variants in this study. The variant

rhythms were created by modifying the timing of one event or by adding or removing one event,

with modifications equally distributed over the length of the sound files. The 10 parent rhythms

were then grouped with either themselves or one of the three variants, resulting in 37 stimuli –

three variants were excluded due to unacceptably high error rates for these variants found by

Hung (2011). Figure 1 presents an illustration of the stimulus creation process (fig. 1a), as well


as the waveforms of two stimuli as they were presented to our subjects, one with a difference in

the clapstick (fig. 1c) and one with a difference in the vocal rhythm (fig. 1b). Both sample

stimuli are available for listening online (https://soundcloud.com/ethnomusicology-

osu/sets/stimuli-for-differential-short).

Figure 1, Stimuli from experiment 1. A: Schematic diagram showing the creation of a "parent" rhythm and its variants. Four waveforms are created from one parent combination of clapstick (c.) and voice (v.). Variants in which the rhythm is the same are marked "=," altered rhythms are marked "≠." B: vocal rhythms = different, clapstick rhythm = same. C: vocal rhythm = same, clapstick rhythms = different. Arrows mark the points of difference.

Using the stimuli described above, ISIs of 0.5s and 12.5s were chosen, in order to place

them within and outside of the phonological store of Baddeley and Hitch’s (1974) model for

memory and at the upper and lower end of Cowan’s (1984) auditory memory. Presumably,

comparisons at the short ISI (immediate recall task) will make use of some form of echoic

memory (Neisser, 1967), whereas the long ISI (delayed recall task) will require some other


mechanisms. Stimuli were grouped in four blocks of 18 or 19 stimuli, and the duration and task

variables kept constant within each block.

Equipment:

The experiment was conducted on a Sony Vaio laptop running the Windows XP environment.

Stimuli were presented with the DMDX software package (Forster & Forster, 2003) and

responses were recorded via the built-in track pad buttons labelled with the corresponding

response; “=” for same, “≠” for different. RT and correct-incorrect status was recorded for each

response. Stimuli were presented via Sony MDR-V200 headphones connected to the laptop’s

built-in headphone jack.

Procedure:

Trial stimuli, not part of the experimental set, were presented with verbal instructions for the

experiment, and the subject's understanding of the task was evaluated informally. During this

practice phase the subjects adjusted the volume to a comfortable listening level. Over the course

of the test phase stimuli at each ISI were presented twice to every subject, once with the task to

judge the vocal rhythm and once with the task to judge the clapstick rhythm. In each case the

subjects were instructed to ignore the other instrument and pay attention only to target rhythm–

vocal or clapstick. Block order and experimental tasks were balanced across subjects to reduce

the impact of any possible learning effects on the results. At the beginning of each block,

instructions for that block would appear on the screen, asking either “is the VOICE the same?”

or “is the CLAPSTICK the same?” On each trial the first stimulus of a pair was presented,

followed by a pause of the appropriate duration for the block (0.5s or 12.5s), followed by the

second stimulus of the pair. The subjects were asked to make a decision as quickly and

accurately as they could, using the first and second fingers of their dominant hand to push the

appropriate trackpad buttons. The experiment lasted approximately 60 minutes including the

instruction period. After completion of the final block subjects were debriefed.

1.3. Results:

Throughout the current study we use the following set of experimental variables (names

capitalized for unequivocal identification): Our main variable, ‘TIMBRE,’ will refer to subjects

being asked to respond to either vocal or clapstick rhythms. ‘RETENTION’ will refer to the

inter-stimuli interval (ISI) between the two stimuli of each presentation pair, and takes short and

long values. A third variable, ‘CONGRUITY,’ refers to the TIMBRE rhythms of a stimulus pair

being either the ‘same’ or ‘different’. The fourth variable, ‘TRAINING,’ refers to musical

training, has two levels, ‘musician’ and ‘non-musician,’ and allows us to detect whether coding

and representation differences due to musical training affect vocal and instrumental rhythm

memorization.


Reaction Time:

A repeated measurement ANOVA with a between-subject factor TRAINING (musician/non-

musician) and within-subject factors of TIMBRE (clap/voice), RETENTION (long/short), and

CONGRUITY (same/different) was performed on the RT data after exclusion of all RTs greater

than three standard deviations from the mean, and after averaging each subject’s multiple

responses for each factor combination to account for the interdependencies of the repeated

measures. The ANOVA was thus performed on means of means, which were normally

distributed. Due to the unequal sample size of the between-subject factor, weighted means were

used in calculating the ANOVA.

RETENTION and CONGRUITY were found to be significant at α = .05 and other main

factors failed to reach significance. There was also a significant 2-way interaction for

TRAINING and TIMBRE. These results are summarized in Table 1.

(Insert Table 1 here)

The significant effect of RETENTION had a size (mean response difference) of 0.54s

(Cohen’s d = 1.16) and that of CONGRUITY had a size of 0.74s (Cohen’s d = 1.99). The means

for RETENTION were 1.44s (long; C.I. 1.35 to 1.53) and 0.9s (short; C.I. 0.83 to 0.87); for

CONGRUITY they were 0.80s (same; C.I. 0.74 to 0.87) and 1.54s (different; C.I. 1.46 to 1.61),

and the difference between the two was not affected by the length of the retention span (fig.2A).

Musicians were faster for clapstick (1.15s; C.I. 1.03 to 1.27) than non-musicians (1.21s; C.I. 1.08

to 1.34) whereas musicians were slower for vocal rhythms (1.19s; C.I. 1.08 to 1.31) than non-

musicians (1.13s s; C.I. 1.00 to 1.26; fig.2B). The range of our RTs did not seem too surprising

as similar ranges have been reported by studies with task of comparable complexity. Malmberg

and Xu (2007) reported RTs between 1.7 and 2.5s for memory recognition tasks, Will,

Nottbusch, and Weingarten (2006) reported initial latencies in picture naming and word typing

tasks of 1.6 and 1.2s, respectively, and Alvarez, Cottrell, and Afonso (2009) reported means of

1.5s for similar tasks.

Figure 2, A: Reaction time interaction plot from experiment 1 for RETENTION:CONGRUITY. RT (s): reaction time in seconds, long: 12.5s RETENTION, short: 0.5s RETENTION, d: ‘different’ rhythm, s: ‘same’ rhythm. B: Reaction time interaction plot from experiment 1 for


TRAINING:TIMBRE. clap: clapstick rhythm, vocal: vocal rhythm, m: musicians, n: non-musicians. Error bars represent 1 standard error. Accuracy:

We used R (R Development Core Team, 2012) and the lme4 package (Bates, 2010; Bates,

Maechler, Bolker & Walker, 2014) to assess the relationship between responses and

experimental variables in a generalized linear mixed model with a binomial error distribution and

a log link function (as GLMMs in R do not output confidence interval (for rationale see Baayen

et al., 2008) we list P>|z| values as measures of uncertainty of the parameter estimates in our

result tables). As fixed effects, we used TRAINING, RETENTION, TIMBRE, and

CONGRUITY and as random effect we used the intercepts for subjects, nested within

TRAINING (two levels). Residual plots did not reveal any obvious deviations from

homoscedasticity or normality. The initial full model was fitted by sequential removal of non-

significant components. The nesting of subjects within factor TRAINING was not significant (χ2

= 0.171; p = .918) and was not included in the fitted model. The minimal significant model was

then subjected to an ANOVA and the probabilities for the F-values determined via the

Satterthwaite approximation (using the lmerTest package; Kuznetsova, Brockhoff, &

Christensen, 2014). Following Baguley (2009), both standardized (z-values for the estimated

coefficients) and simple effect sizes (mean response differences) are reported, and all figures

show raw data values and SE bars.

The ANOVA results showed that all main factors, the two-way interactions

TRAINING:TIMBRE, RETENTION:TIMBRE, TIMBRE:CONGRUITY, the three-way

interaction of TRAINING:TIMBRE:CONGRUITY, and the four-way interaction of the main

factors were significant. Estimated standardized effect sizes for the model are shown in table 2,

and the simple effect sizes (mean response differences) indicate that correct responses for

musicians were 7.7 percentage points (pp) higher than for non-musicians, that the short

RETENTION produced 5.1 pp more correct responses than the long RETENTION, that

responses to clapstick rhythms were 10.9 pp higher than to vocal rhythms, and that ‘different’

stimuli were correctly identified 4.3 pp more often than ‘same’ pairs.


However, due to the TIMBRE:CONGRUITY interaction, clapsticks showed a larger

difference (8.2 pp) between same and different decisions than did vocal rhythms (1.1 pp). The

TRAINING:TIMBRE interaction was due to the larger difference between clapstick and vocal

rhythms in musicians (15 pp) than in non-musicians (4.8 pp). The RETENTION:TIMBRE

interaction was caused by the stronger effect of RETENTION on vocal rhythms, which were

8.3 pp better for the short than for the long RETENTION, while the difference for clapsticks was

only 1.7 pp. The three-way interaction TRAINING:TIMBRE:CONGRUITY reflected the fact

that musicians showed a larger difference for same and different responses to clapstick rhythms

(10.5 pp) than non-musicians (4.1 pp), whereas the difference for vocal rhythms was minimal for

both groups (0.8 pp and 0.1 pp, respectively). The 4-way interaction between the main factors


was attributable to the fact that for the short retention span the difference between ‘same’ and

‘different’ responses was 2.4 pp for vocal rhythms and 5.0 pp for clapstick rhythms, whereas for

the long span they were 1.3 pp and 10.9 pp, respectively. In addition, the difference between

‘same’ and ‘different’ responses for clap and vocal rhythms was affected by TRAINING: for the

long retention span the musicians’ different decisions for vocal rhythms were 27.7 pp lower than

for clapstick, whereas same decisions were only reduced by 8.8 pp, and for non-musicians the

corresponding differences were 9.4 pp for different and 7.3 pp for same decisions (see fig. 3A).

Figure 3, A: Accuracy interaction plot for RETENTION:TIMBRE split by TRAINING from experiment 1. Accuracy(%): percentage correct, long: 12.5s RETENTION, short: 0.5s RETENTION, clap: clapstick rhythm, and vocal: vocal rhythm. B: Accuracy interaction plot for TIMBRE:CONGRUITY from experiment 1, clap: clapstick rhythm, and vocal: vocal rhythm, d: ‘different’ rhythm, s: ‘same’ rhythm. Error bars represent 1 standard error.

As it could be argued that our results might be due to changes in subjects’ bias to make

‘same’ or ‘different’ decisions, we additionally calculated the response bias measure β

(MacMillan & Creelman, 2005) for all main factor levels using the sdtalt R package (Wright,

Horry, & Skagerberg, 2009) and compared the two β for each factor level using t tests. We found

no bias difference between musicians and non-musicians or between short and long retention

spans. Only the t test for TIMBRE for musicians in the long RETENTION condition was

significant (t = -3.49, df = 14, p = .003), all other t test showed p values between .08 and .67.

Closer inspection of this significant case showed that this effect was due to a change in the

proportion of correct responses for vocal rhythms in the different condition for the long

RETENTION as compared to short RETENTION. However, no corresponding bias changes for

the musicians in the short RETENTION, or in any of the tests for the non-musicians were found,

and a response bias change does not seem to contribute to these results.

1.4.Discussion:

In this experiment we found that the length of the memorization period affected both RT and

accuracy of the responses. Delayed recall resulted in a significant increase of RT compared to

immediate recall, with an overall increase of 0.541s and a significant reduction in accuracy by

5.1 pp. Rhythm memory performance declines with prolongation of the retention period from


0.5s to 12.5s. With the experimental design applied here (presentation of 1st stimulus → retention

period → presentation of 2nd stimulus → decision), it seems difficult to explain the observed

memory decay, as argued by Nairne (2002), as an effect of distractors or interferences during the

retention period. It therefore seems a more fitting explanation to assume an inherent instability of

memory traces for rhythms because the retention periods did not involve interventions of any

sort. Evidence for decay of an auditory memory trace has also recently been presented for the

frequency and intensities of sounds (Mathias, Micheyl & Shinn-Cunningham, 2014; Mercer &

McKeown, 2014). We will return to this topic in the general discussion.

Consistent with previous findings (Proctor, 1981) RTs for ‘different’ decisions were

longer by 0.71s than ‘same’ decisions. We also found a significant effect of CONGRUITY on

accuracy, with ‘different’ stimuli identified correct 75.1% of the time, whereas the ‘same’ stimuli

were correctly identified 70.9% of the time. Neither the RT difference nor the accuracy

difference between ‘same’ and ‘different’ decision was affected by RETENTION. Interestingly,

though accuracy performance for clapstick only dropped by 1.7 pp from short to long

RETENTION, performance for vocal rhythm dropped by 8.3 pp, causing the significant

RETENTION:TIMBRE interaction.

Though RTs did not show a significant effect for TRAINING, musicians were

significantly more accurate (76.0%) than non-musicians (68.3%). This appears to be consistent

with our initial hypothesis that musical training might confer a general advantage. However, the

significant TRAINING:TIMBRE interaction showed that accuracy differences between

musicians and non-musicians are remarkably larger for clapstick (15.0 pp) than for vocal

rhythms (4.8 pp). The significant 2-way interaction of TRAINING:TIMBRE for RT showed that

musicians were faster than non-musicians on clapstick but slower on vocal rhythms. These

accuracy and RT differences between the two participant groups indicates that musical training

tends to be more of an advantage for extracting and memorizing clapstick than vocal rhythms, a

possible explanation for which is explored in the general discussion.

For the accuracy data, the interaction of TRAINING:TIMBRE:CONGRUITY as well as

the four-way interactions between the main factors indicate that they were mainly due to the

different response rates of musicians and non-musicians for ‘different’ decisions on vocal and

clapstick rhythms, with musicians showing a larger difference (27.7 pp) for the two rhythms than

non-musicians (9.4 pp). For ‘same’ decisions the differences for the groups were quite similar,

8.8 pp and 9.4 pp, respectively. An explanation for this interaction could be that for ‘different’

decisions musicians, due to training, might make use of different stimulus representations (e.g.

perceptual categorizations, or feature extractions and conceptualizations) than non-musicians.

Different encodings - the forms in which information is placed in memory - also might be

involved in the significant interaction between TIMBRE and CONGRUITY we found for

decision accuracy. ‘Different’ decisions on vocal rhythms were only slightly (1.1 pp) better than

‘same’ decisions, whereas the corresponding difference for clapstick rhythms was 8.2 pp.

Assuming that ‘same’ and ‘different’ decisions are based on two different matching processes

(Bagnara et al., 1982; Keuss, 1977; Markman & Gentner, 2005), a fast, holistic one for ‘same’


and a slow, analytic one for ‘different’ decision, the two processes may involve different forms

of encodings that could explain both our accuracy and RT data. The fact that Keuss (1977) found

a same/different effect on RTs but not on accuracy suggests that same/different decisions in

(visual) letter and digit search tasks (for Keuss’ study) involve different processes than in

(auditory) rhythm comparison tasks (for the current study). An alternative explanation which

assumes only one matching process with rechecking for different decisions (Briggs & Johnson,

1973; Krueger, 1978), and operating on only one form of rhythm representations is not able to

account for our accuracy data (i.e. the significant interaction between TIMBRE and

CONGRUITY).

The experiment shows that, on average, clapstick rhythms led to significantly more

accurate responses (78.4%) than vocal rhythms (67.5%). The biggest advantage seemed to lie

with the clapstick/different stimuli, which were correctly identified 82.4% of the time, compared

to 74.5% for clapstick/same, and 67.2% and 67.7% for voice/same and different, respectively.

Though it might seem possible that this result is confounded by different levels of complexity of

the vocal and instrumental rhythms, Hung (2011) tested the relationship of rhythms consisting of

simple integer ratios and non-integer ratios intervals to the respective error rates using a χ2 test,

and found the relationship not significant (p = .53). In addition, the way the stimuli were chosen

(see section 1.2 Stimuli) eliminated any possible semantic influence to account for these

differences. A remaining plausible explanation is that the physical sound features of clapstick

and vocal rhythms could account for the difference in memorization . One potential variable that

may drive this difference–the filled versus unfilled nature of the stimuli–is tested in experiment

4.

An interesting question was highlighted during the post-interview debriefs; what strategy

was being used by subjects over the longer ISI’s to retain these rhythms? Many subjects noted

that they used some form of silent, “vocal” repetition for the vocal rhythms, and some form of

non-vocal motor repetition (e.g. finger-tapping) for the clapstick rhythms. This was true for both

musicians and non-musicians. A likely candidate for this memorization process is the

articulatory loop (Baddeley & Hitch, 1994), and this is tested in the second experiment.

Experiment 2:

2.1.Introduction:

The articulatory loop is thought to offset memory trace degradation through repeated rehearsal of

stimuli (Baddeley & Hitch, 1994). If it is a mechanism for memorization of verbal information,

as Baddeley and Hitch suggest, then it may also be involved in processing timing information

produced by human vocal organs. Thus we might not expect clapstick rhythms to be ‘rehearsed’

via the articulatory loop as they are not produced in the vocal modality.

As noted in the discussion for experiment 1, multiple subjects reported rehearsal

strategies for the vocal rhythm task that qualitatively support the idea of the involvement of the

articulatory loop. Interestingly, however, they also reported an alternate strategy for the clapstick


rhythm task; namely they used non-vocal ‘rehearsal’ by tapping or other rhythmic movements,

which may implicate an alternate rehearsal strategy other than, or in addition to, the articulatory

loop. These subject reports seem to contrast with the findings of Saito and Ishio (1998) wherein

concurrent sub-vocal articulation was found to degrade a rhythmic reproduction task more

significantly than a concurrent drawing task or the control (without simultaneous motor task).

The rhythmic stimuli in that study were pure tones, however, and the possibility of differential

memory for vocal and non-vocal sounds was not considered. The use of articulatory suppression

has been shown to be a consistent method of occupying the limited space of the phonological

loop, thereby preventing participants from using the associated refresh mechanism and resulting

in the deterioration of stored material (Baddeley, 1975; Keller, Cowan & Saults, 1995; Salamé &

Baddeley, 1982). Concurrent finger-tapping has sometimes been used as a comparison control

for articulatory suppression — the two tasks are assumed by researchers to require roughly

equivalent attention and./or motor activation (e.g. Baddeley, Eldridge, & Lewis, 1981; Halliday,

Hitch, Lennon, & Pettipher, 1990; Papagno, Valentine, & Baddeley, 1991). However, because of

the differences we found for vocal and non-vocal rhythm processing, one might expect to see

differential effects from these concurrent motor tasks. Specifically, we would expect that

concurrent finger-tapping would degrade clapstick rhythm memory more than vocal rhythm

memory, and that sub-vocal articulation would degrade vocal rhythm memory more than

clapstick rhythm memory. Experiment 2 was designed to test these hypotheses.

2.2.Methods:

Participants:

Musical experience was determined as in experiment 1. Twelve non-musicians (5 female; avg.

age = 23 years, min. 19, max. 34) and 14 musicians (8 female; avg. age = 25 years, min. 21, max.

34) participated in the experiment and all reported normal hearing. Non-musicians had an

average of 0.3 years of formal music training (min. = 0, max. = 3), and musicians 12.8 years

(min. = 7, max. = 18). Three subjects (2 non-musicians) were unable to complete the entire

experiment and are not included in the analysis.

Stimuli:

The stimuli were those with the long ISI (12.5s) from experiment 1, again grouped into two

blocks of 18 and 19 pairs, respectively. These blocks were combined with the three distractor

tasks (control, i,e no motor action; finger-tapping; repeated sub-vocal articulation of the syllable

‘the’), and the resulting six blocks were presented twice, once with the instruction to respond to

the vocal rhythm, and once to respond to the clapstick rhythm. Block presentation and

experimental tasks were balanced across subjects.

Equipment:

The equipment was the same as experiment 1.


Procedure:

The procedure for this experiment was the same as experiment 1 with two changes: subjects

were informed which DISTRACTOR task to perform during each block, and subjects used their

dominant hand for the tapping DISTRACTOR, and their non-dominant hand to push the

appropriate response buttons. The experimenter monitored participants’ lip and jaw movements

for the sub-vocal task and hand movement for the tapping task to ensure adequate performance

of the DISTRACTOR tasks, and reminded participants of proper procedure when necessary. The

experiment took approximately 2 hours to complete.

2.3.Results:

In addition to TRAINING, TIMBRE, and CONGRUITY as used in the first experiment, the

second and third experiments include DISTRACTOR with the three levels described above.

Reaction time:

A repeated measurement ANOVA with TRAINING as between- and the other three factors as

within-subject factors was performed on the RT data after exclusion of outliers and averaging

over factor combinations as in experiment 1 (table 3). TIMBRE and CONGRUITY were found

to be significant at the α = .05 level, but there was no significant effect for TRAINING or

DISTRACTOR. The mean difference between same (1.2s; C.I. 1.16 to 1.29) and different

CONGRUITY (2.15s; C.I. 2.10 to 2.24) was 0.95s (Cohen’s d = 2.32), and decisions on vocal

rhythms (1.63s; C.I. 1.53 to 1.73) were 0.1s faster (Cohen’s d = 0.37) than those on clapstick

rhythms (1.73s; C.I. 1.62 to 1.85).


There was a significant 2-way interaction for DISTRACTOR and CONGRUITY

indicating that the ‘different’ decisions were more affected by the DISTRACTOR than were the

‘same’ decisions. A significant interaction was also found between TRAINING, DISTRACTOR

and CONGRUITY, indicating that musicians were affected in their ‘different’ decisions by the

tapping distractor, but not non-musicians were not. (see fig.4).


Figure 4, Reaction time (RT) interaction plot for DISTRACTOR:CONGRUITY, split by TRAINING from experiment 2. Contr: control, sub-voc: sub-vocal articulation, tap: finger-tapping, d: ‘different’ rhythm, s: ‘same’ rhythm. Asterisk marks musicians’ mean for tapping that is significantly different from their control and sub-vocal means. Error bars represent 1 standard error.

Accuracy:

A generalized linear mixed effects analysis was performed on the mean accuracy data with

TRAINING, DISTRACTOR, TIMBRE, and CONGRUITY as fixed effects, and subjects as

random effects. A binomial error distribution and the logit link function were chosen for the

model. ANOVA was performed on the minimal fitted model and corresponding p values were

calculated using Satterthwaite approximation and results are displayed in table 4.


All four main factors of the model as well as the interactions DISTRACTOR:TIMBRE,

DISTRACTOR:CONGRUITY, and TIMBRE:CONGRUITY were significant (fig. 5). Model

estimates showed that musicians gave more correct responses than non-musicians, with an effect

size (mean response difference) of 7.8 pp. Response rates to both the tap (-10.3 pp) and the sub-

vocal distractor (-11.4 pp) were lower than for the control, but there was no significant difference

between the two distractors. The correct response rate for vocal rhythms was 13.4 pp lower than

that for clapstick rhythms and same decisions were 9.1 pp better than different decisions.

The interaction DISTRACTOR:TIMBRE showed that in comparison to controls the

accuracy for vocal rhythms was less affected by tapping (-8.2 pp) or sub-vocal articulation (-9.3

pp) than the corresponding accuracy for clapstick rhythms (-12.5 pp and -13.5 pp, respectively) .

The interaction DISTRACTOR:CONGRUITY was due to the fact that for controls the difference

between ‘same’ and ‘different’ decisions was 4.2 pp, whereas for tapping it was 14.0 pp and for

sub-voc. distractor it was 9.2 pp. Finally, the TIMBRE:CONGRUITY interaction arose from the

difference between 'same' and 'different' decisions for vocal rhythms was larger (15.9 pp) than

for clapstick rhythms (2.5 pp).

Figure 5, A: Accuracy interaction plot for TIMBRE:CONGRUITY from experiment 2. Accuracy(%): percentage correct, clap: clapstick rhythm, vocal: vocal rhythm, d: ‘different’


rhythm, s: ‘same’ rhythm. B: Accuracy interaction plot for DISTRACTOR:CONGRUITY from experiment 2. Contr : control, sub-voc: sub-vocal articulation, tap: finger-tapping, d: ‘different’ rhythm, s: ‘same’ rhythm.. Error bars represent 1 standard error.

We also calculated response bias measure β the same way as in experiment 1 to test

whether the results were produced or affected by changes in subjects’ bias to make ‘same’ or

‘different’ decisions. The grouped t tests for the main factor levels led to p-values ranging from

.39 to .09, none of them being significant. There was no indication that the results were

influenced by changes in the subjects’ response bias.

2.4 Discussion:

As expected, the additional cognitive load of the concurrent motor tasks leads to reduced

accuracy in same/different decisions (control: 76.9%, tapping: 66.7%, sub-vocal articulation:

65.8%). However, there was no significant difference between tapping and sub-vocal

articulation. This result does not support the hypothesis that the articulatory loop is involved in

differential memorization of vocal versus instrumental rhythms. It also contrasts with Saito and

Ishio’s (1998) results, showing that a concurrent sub-vocal articulation task significantly

degraded a rhythmic reproduction task, but a concurrent drawing task did not. Furthermore,

given the significant difference in accuracy for memory of vocal versus clapstick rhythms found

in experiment 1, we expected a particular type of interaction between the DISTRACTOR motor

tasks and the TIMBRE task: that a concurrent sub-vocal articulation task would interfere more

with memory for vocal rhythm than would a concurrent finger-tapping task, but that was not

what we found.

Although the hypothesis that the articulatory loop is generally involved in memorization

of rhythms was not supported, results were consistent with those of experiment 1. As expected,

clapstick decisions (76.3%) were more accurate than the voice decisions (63.1%) overall. In

addition, compared with the control, RT was significantly faster for vocal than for clapstick

rhythms for the two distractor tasks (0.095s for tapping and 0.15s for sub-vocal articulation.),

suggesting that for dual-rhythm stimuli vocal information is processed faster if cognitive

resources are constrained by concurrent motor tasks. Furthermore, non-musicians (65.1%) were

slightly more affected by the concurrent tasks than musicians (72.9%).

In line with results from experiment 1, ‘same’ decisions were made faster and more

accurately (RT = 1.2s; 74.1%) than ‘different’ decisions (2.15s; 65.0%), adding further support

for the idea that different processes and representations may be involved in ‘same’ and

‘different’ decisions.

The two interactions of TIMBRE:CONGRUITY and DISTRACTOR:CONGRUITY

were a further indication that different processes may be involved in same/different decisions

(Keuss, 1977; Markman & Gentner, 2005), one fast, holistic, and the other slow and analytic: the

slow one is cognitively more demanding (longer RT) and therefore more affected by the

additional cognitive load of a distractor task.


This experiment provided further evidence supporting the idea of differential processing

of vocal versus instrumental rhythms. However, the introduction of concurrent tapping or sub-

vocal articulation did not yield a significant difference between these two tasks, suggesting that

the articulatory loop may not be involved in rhythm decision tasks. The following experiment

will explore the possibility of a role for the articulatory loop in the reproduction of memorized

rhythms.

Experiment 3:

3.1.Introduction:

The outcome of experiment 2 seems to stand in contrast with the study of Saito and Ishio (1998)

that indicated an involvement of the phonological loop in rhythm memory. Differences between

the two experiments that could explain the different results concern stimulus features and

experimental tasks. Saito and Ishio used empty rhythms, while we used empty and filled rhythms

(differences between empty and filled rhythms will be addressed in experiment 4 below), and

their participants had to reproduce rhythms whereas ours had to make decisions. Hence our third

experiment, in which a rhythm reproduction task is substituted for the same/different decision

task, will test the idea that the articulatory loop may be recruited for rhythm reproduction tasks.

3.2.Methods:

Participants:

Musical experience was determined as in experiment 1. Fourteen non-musicians (8 female; avg.

age = 24.4 years, min. 19, max. 34) and 12 musicians (5 female; avg. age = 27.27 years, min. 20,

max. 35) participated in this experiment. Non-musicians had an average of 2.1 years (min. 0,

max. 8) and musicians an average of 17.05 years of formal music training (min. 5, max. 28).

Stimuli:

Twenty-four stimuli that had received the highest scores were selected from the material used in

experiment 2. Each trial was set up as follows: a stimulus was followed by a silent delay interval

of 12.5s which ended with a short 1kHz “beep” as a signal for the participants to start tapping the

rhythm they had just heard. The 24 trials were split into two blocks of 12 trials. Each block was

repeated with every combination of DISTRACTOR and TIMBRE tasks. Block presentation and

experimental tasks were balanced across the subjects in order to reduce the possible influence of

practice effects.

Equipment:

The equipment was the same as experiment 1. In addition, participants tapped the response

rhythms with a small metal rod on a wooden tablet in front of the laptop with a built-in

microphone, and their responses were recorded by the presentation software (DMDX).


Procedure:

The procedure for this experiment was the same as experiment 2 but with the task changed to the

reproduction of the stimulus rhythms. After the experiment, which took approximately 90

minutes, participants were debriefed.

Data Analysis:

Responses were classified as correct if the number of response and stimulus events was identical,

and if the ratios of the response intervals deviated less than 20% from those of the stimulus

intervals.

3.3.Results:

A linear mixed effects analysis (see experiment1) on the accuracy data was performed with

TRAINING, DISTRACTOR, and TIMBRE as fixed effects and subjects as random effects. The

minimal fitted model only contained the main factors as none of the interactions turned out to be

significant. ANOVA results for the model and model parameter estimates are shown in table 5.


The ANOVA confirmed that only the main factors were significant. Model parameter

estimates, their associated z-values, and the effect sizes (mean response differences) indicated the

following: musicians were correct 25 pp more often than non-musicians; the sub-vocal

DISTRACTOR affected the accuracy significantly (-9.7 pp), but the tapping DISTRACTOR was

not significantly different from control; accuracy for the reproduction of clapstick rhythms was

41 pp better than that for vocal rhythms, and none of the interactions turned out to be significant.

Figure 6, Accuracy interaction plot for DISTRACTOR:TIMBRE, split by TRAINING from experiment 3. Contr : control, sub-voc: sub-vocal articulation, tap: finger-tapping, voc: vocal rhythm, clap: clapstick rhythm. Error bars represent 1 standard error.


3.4.Discussion:

In this reproduction experiment we found that rhythm memory was significantly more negatively

affected by a concurrent sub-vocal articulation than by a tapping task. The larger effect of the

sub-vocal articulation task can be attributed to an involvement of the articulatory loop in short-

term memorization, and supports our initial hypothesis about such an involvement in a delayed

rhythm reproduction task. This is in agreement with the findings of Saito and Ishio (1998).

However, together with the results from our experiment 2, we arrive at a different interpretation

than these authors. It seems unlikely that the articulatory loop is generally involved in rhythm

memory: if rhythms are kept in memory for non-motor responses (comparison and/or decision) a

concurrent sub-vocal articulation was not found to have a stronger effect that a concurrent

tapping task (experiment 2). However, if rhythms are memorized for subsequent motor action,

concurrent sub-vocal articulation has a significantly stronger effect than concurrent finger-

tapping, suggesting the involvement of the articulatory loop. A likely explanation is that rhythm

memorization for action involves a different representation than memorization for comparison

and decision, e.g. a form of encoding that enables vocal articulation (or pronunciation) and

processing by the articulatory loop. Such an explanation is supported by recent studies (Coull et

al., 2010; Wiener et al., 2011) indicating different processing networks underlying perceptual

and motor timing. Whether these different representations or processes exist simultaneously,

being tapped on demand, or whether they are only created in response to the specific

experimental task, remains to be addressed in further research. Earlier studies on verbal short-

term memory (Levy, 1971; Morton, 1970), however, suggested coexisting dual acoustic and

articulatory encodings for short-term memory of verbal material, and seem to point towards the

first possibility.

In our experiment musicians (70.9% mean correct responses) were better than non-

musicians (45.2% mean correct responses). The difference between them (25.7 pp) was much

larger than in the decision experiment (7.8 pp). This could be due to two factors: First, the

training that is a criterion for musician status in our study likely means that non-musicians were

less able to perform a correct reproduction of the auditory stimulus. Second, training received by

the musicians likely means they were able to transform the auditory stimulus rhythm into a more

stable form for memorization. Superior memory in musicians has been suggested by Jakobson et

al. (2008) and Tervaniemi et al. (2001). Determining the relative role of these two factors –

performance and memory trace stability – in causing this difference will require further research.

Another difference between experiments 2 and 3 was that the difference between

clapstick and vocal rhythms (40.4 pp) were larger than in the decision experiment (13.4 pp).

Notably, there was a clear experimental task effect: clapstick rhythms were no more affected in

the reproduction than in the decision task (experiment 2), whereas the accuracy for vocal

rhythms was considerably reduced. In addition, decisions on vocal rhythms were more affected

by the sub-vocal DISTRACTOR than clapstick rhythms. As there were no significant

interactions between TRAINING and TIMBRE, the results for the main factors can be


considered as further support for our hypothesis that clapstick and vocal rhythms have different

representations in memory, and suggest that the representation of the clapstick rhythms for

reproduction tasks and their representation for decision tasks may not be very different from each

other. This is in contrast to the representations of vocal rhythms, which seem to more strongly

rely on articulatory encoding in the reproduction task.

Experiments 2 and 3 support that multiple forms of representation or encoding may be

involved in short-term rhythm memorization, and that they are activated and/or tapped

depending on whether the task requires motor action or comparison/decision. They also add

further evidence for different processing and representations of vocal and instrumental rhythms.

In the following experiment we test to what extent these differences can be explained by the

contrast between the ‘empty’ clapstick and the ‘filled’ vocal rhythms of our stimuli.

Experiment 4:

4.1. Introduction:

Vocal rhythms are formed by a complex combination of changes in pitch, duration, amplitude,

and timbre while clapstick rhythms are constituted by events that show minimal variation in

pitch and timbre. One of the conspicuous differences between the two is that vocal rhythms are a

series of seemingly continuous sounds whereas clapstick rhythms are a set of discontinuous

sounds. This difference has been described by the terms ‘filled’ and ‘empty’: for filled intervals

the end of one event coincides with the start of the following event, whereas for empty intervals

one event ends before the following event starts, i.e. events are separated by silence.

Differences in the perception of these two interval types were already described over

hundred years ago. In his Principles of Psychology (1890), James described the ‘filled interval

illusion’ that refers to the phenomenon that filled intervals are perceptually longer than empty

intervals, even though the duration of both intervals is the same. Subsequently, experimenters

have found that the estimation of filled interval duration is more accurate than that of empty

intervals (Goldstone & Goldfarb, 1963; Rammsayer & Lima, 1991; Rammsayer & Skrandies,

1998). Though most research on duration perception of filled and empty intervals deals with

single intervals (but see: Repp & Bruttomesso, 2009), it seems possible that these effects will

also be found for interval sequences.

Here our hypothesis is that, if the difference in memory for vocal and clapstick rhythms is

due to the difference between empty and filled rhythms, then we should find no difference

between the memory for vocal and filled instrumental rhythms. We would, however, expect to

find a difference between empty and filled instrumental rhythms of about the same order as the

differences between vocal and clapstick rhythms in experiment 1.

4.2. Methods:

Participants:


Musical experience was determined as in experiment 1. Twenty-one participants with self-

reported normal hearing took part in this experiment. Ten non-musicians (5 female; avg. age =

27.1 years, min. 20, max. 45) had an average of 2.55 years of formal music training (min.0, max.

10), and 11 musicians (6 female; avg. age = 28.27 years, min. 22, max. 34) had an average of

20.55 years of training (min.10, max.34).

Stimuli:

Stimuli were prepared as follows: The vocal rhythm set consisted of 14 sound clips from the

material used in the previous experiments. From each of these files we extracted the amplitude

envelope, and convolved it with a cello sound sample (cello model Charles Quenoil 1923;

sample from the University of Iowa Electronic Music Studio) to create the filled instrumental

rhythms. Several instrumental sounds were tried before settling on the cello as an adequate

stimulus choice. The cello sample was judged similar enough to the male voice in terms of pitch

and timbre in informal listening tests. Furthermore, the re-synthesis procedure – described next –

caused less obvious distortions using the cello sample than the other instrument samples tried.

Fourteen empty instrumental rhythms were formed by using 0.08s sound events created from the

same cello sound sample, with 0.03s rise time and 0.05s decay time to obtain amplitude

envelopes similar to those of the clapstick sounds of our earlier experiments. Their amplitude

peaks were aligned in time with the amplitude peaks of the filled rhythms, and the mean

amplitude of the empty rhythms was then adjusted to match that of the two other sets.

As for the previous experiments we created variants for each rhythm that differed in one

event by eliminating, adding, or changing the timing of one event. Types and locations of

changes within the files were evenly distributed across the sets. The resulting 84 rhythms (3*14

‘parents’ and 3*14 variants) were grouped into pairs in such a way that each of the parent

rhythms occurred once paired with itself and once with its variant. With these pairs we formed

two sets of 84 stimulus files, one for the immediate recall with an ISI of 0.5s between the two

rhythms, and one with an ISI of 15s for the delayed recall. Each of the two sets was then split

into two presentation blocks of 42 stimuli in such a way that rhythms paired with themselves and

with their variants did not occur in the same block.

Procedure:

The procedure for this experiment was the same as experiment 1. The experiment took

approximately 50 minutes.

4.3. Results:

The fourth experiment includes an additional factor 'RHYTHM' with three levels to denote the

stimulus types used; vocal, “filled” instrument, or “empty” instrument.


Reaction time:

A repeated measurement ANOVA with factors TRAINING, RETENTION, RHYTHM, and

CONGRUITY was performed on the means of the subjects’ RT data. All main factors were

found to be significant (table 6), and there was a significant 2-way interaction for RHYTHM and

CONGRUITY.


The means for RHYTHM were 1.26s (vocal; C.I. 1.14 to 1.38), 1.39s (filled; C.I. 1.26 to

1.53), and 1.21s (empty; C.I. 1.12 to 1.30). Musicians (1.18s; C.I. 1.09 to 1.26) were 0.3s faster

(Cohen’s d = 0.34) than non-musicians (1.41s; C.I. 1.31 to 1.50) and RTs for the short

RETENTION (1.13s; C.I. 1.04 to 1.22) were 0.32s faster (Cohen’s d = 0.83) than for the long

RETENTION (1.45s; C.I. 1.36 to 1.54). The means of both participant groups showed

significantly longer RTs for delayed than for immediate recall. ‘Same’ decisions were made

0.78s faster (Cohen’s d = 2.76) than ‘different’ decisions. The interaction of

RHYTHM:CONGRUITY was caused by the filled instrumental rhythms showing longer RTs

than the other two rhythms in the ‘different’ condition, but not in the ‘same’ condition (see fig.

7A). For ‘different’ decisions, filled instrumental rhythms showed the longest RT and empty

rhythms showed the shortest RT in both RETENTION conditions and for both subject groups.

Figure 7, A: RT interaction plot for RHYTHM:CONGRUITY from experiment 4. B: RT interaction plot for RHYTHM :RETENTION from experiment 4. Emp: empty, fil: filled, voc: vocal rhythms, s: ‘same’ rhythm, d: ‘different’ rhythm, long: 15s RETENTION, short: 0.5s RETENTION. Error bars represent 1 standard error.

Accuracy:

A generalized linear mixed effects analysis with TRAINING, RETENTION, RHYTHM, and

CONGRUITY as fixed effects and subjects as random effects was performed following the

procedure outlined in exp.1. Results for the ANOVA and model parameter estimates of the

generalized linear mixed model are displayed in table 7.



ANOVA results showed that main factors TRAINING, RHYTHM and CONGRUITY, as well as

interactions RETENTION:RHYTHM, RETENTION:CONGRUITY, RHYTHM:CONGRUITY,

and the three-way interaction TRAINING:RHYTHM:CONGRUITY were significant (fig. 8).

Estimated standardized effect sizes for the model are shown in table 7, and the simple effect

sizes (mean response difference) for the responses indicated the following: musicians gave 10.2

pp more correct responses than non-musicians; correct response rates for filled RHYTHM were

28.9 pp lower than for empty, while the difference between empty and vocal was 2.8 pp; ‘same’

pairs were judged 12.1 pp more correctly than ‘different’ pairs. Vocal RHYTHM was 7.7 pp

more correct for the short than for the long retention span, while responses to filled RHYTHM

did not change significantly. Responses for ‘same pairs’ were correct 18.7 pp more often for the

short than the long RETENTION, whereas the rate change for ‘different’ pairs was not

significant. Response rates for both filled (+29.7 pp) and vocal RHYTHM (+12.9 pp) were better

in the ‘same’ than in the ‘different’ condition. Finally, the difference in correct responses for

‘same’ and ‘different’ pairs in non-musicians and musicians was significant, both for filled (n:

45.5 pp , m: 13.9 pp) and for vocal RHTYHM (n: 24.0 pp, m: 1.9 pp).

Figure 8, A: Accuracy interaction between RETENTION:CONGRUITY from experiment 4. Accuracy(%): percentage correct, long: 15s RETENTION, short: 0.5s RETENTION, s: ‘same’ rhythm, d: ‘different’ rhythm. B: Accuracy interaction plot for RHYTHM:CONGRUITY, split by TRAINING from experiment 4. Emp: empty, fil: filled, voc: vocal rhythms, s: ‘same’ rhythm, d: ‘different’ rhythm. Error bars represent 1 standard error.

4.4. Discussion:

The results of this experiment provide evidence for differential memory for vocal, filled and

empty instrumental rhythms in terms of both RT and accuracy. The participants showed the

shortest RT (1.21s) and the best accuracy (88.6%) for empty rhythms. In contrast, filled rhythms

had the longest RT (1.39s) and showed the lowest accuracy (58.1%). This supports one part of

our initial hypotheses about the difference between empty and filled rhythms. Given that the

empty rhythms were derived from the corresponding filled rhythms, these two sets differ only in


their amplitude envelopes. The results indicate that it was easier to distinguish events and extract

temporal information from empty than from filled rhythms. With spectral composition and pitch

being identical, the factor that seems to account most for this difference was the clear delineation

of the event on- and offsets in empty rhythms. This contrasts with results from studies on single

filled and empty intervals mentioned in 4.1, and indicates that, unlike duration perception (Repp

& Bruttomesso, 2009), accuracy data for single intervals do not generalize to rhythmic sequences

of these intervals.

The second part of our initial hypothesis was that if the memory differences between

vocal and clapstick rhythms were only due to the contrast between filled and empty sounds, we

should find a difference between vocal and empty rhythms but not between vocal and filled

rhythms. However, we found that the differences between vocal and empty rhythms (RT: 0.05s,

accuracy 2.8 pp) were much smaller than between filled and empty rhythms (0.18s, 28.4 pp) and

that the difference between filled and vocal rhythms (0.13s, 26.1 pp) was significant. The

differences in memory for vocal and filled instrumental rhythms can be explained by the

different sounds they used: vocal sounds contain pitch and spectral changes as additional cues for

the rhythm extraction that seemed to improve decision accuracy. At the same time, though vocal

sounds are more complex than the filled instrumental sounds, these additional features seem to

facilitate processing and lead to faster decisions. A main factor contributing to these differences

is probably the preference of the human auditory system for vocal sounds. Neuroanatomical

(Rauschecker, Tian, & Hauser, 1995; Wang, 2000), electrophysiological (Levy et al., 2003) and

imaging studies (Belin et al., 2000; Binder et al., 2000; Zatorre et al., 2002) indicate that the

auditory cortex includes functionally specialized regions that show response preferences for

conspecific vocalizations. Voice is the most important and core medium of the human

communication system and this preference is manifest in the fact that complex voice sounds

were processed faster than simpler ‘non-voice’ sounds. This is in line with speech perception

research showing that human voice sounds lead to faster and better identification than closely

modelled synthetic sounds (Hillenbrand & Nearey, 1999; ter Schure, Chládková, & van Leussen,

2011).

In this experiment, musicians showed significantly faster RT (-0.14s) and better accuracy

(+10.2 pp) than non-musicians. It suggests that musical training positively contributes to rhythm

memory. This is consistent with the results of the rhythm recognition tasks in experiments 1 and

2, and also with those of the reproduction task in experiment 3. The potential role of musical

training on memory for rhythm will be explored further in the general discussion.

The ‘same’ CONGRUITY condition showed faster RT (-0.72s) and better accuracy

(+12.1 pp) than the different condition. Similar results were also found in the first two

experiments, and suggest that unlike for visuo-spatial stimuli (Keuss, 1977), same/different

decisions made on acoustic-temporal stimuli affect both RTs and accuracy of the responses. The

results from these three experiments support the idea that different processes and representations

are involved in same/different decisions (Bagnara et al., 1982; Markman & Gentner, 2005), with


the slower RT for ‘different’ decisions probably being related to additional and different re-

check processes when detecting differences (see section 1.4).

Examining the significant interaction of factors RETENTION and CONGRUITY, the

same condition showed a significant decline in accuracy for long RETENTION and the accuracy

for the different condition improved slightly in long RETENTION. One plausible explanation is

that different types of rhythm representation may be involved in same and different judgments.

Perceptual pre-categorical information may be used for the holistic and automatic same decisions

while categorical representation (e.g. music notation) may be used for the analytic and deliberate

different decisions. Perceptual representations would be derived from sensory memory, which

fades rapidly (Baddeley, 1990). Categorical representations, on the other hand, are better

memorized than sensory traces (Snyder, 2000) and could explain the accuracy in the different

condition. Finally, the interactions between RHYTHM and CONGRUITY and between

TRAINING, RHYTHM and CONGRUITY suggest that the smaller difference between

same/different pairs in musicians may be related to their improved ability to form categorical

representations for rhythm due to their training and knowledge of music notation.

General discussion:

Decay and Encoding

Two retention periods were used in experiment 1 and 4, immediate recall with ISI of 0.5s and

delayed recall with 12.5 and 15s, respectively, and results from both experiments show that RT

increases and accuracy decreases with increasing retention span. Two factors that have been

shown to cause worsening memory performance, divided attention (Craik & Kester, 1999) and

interference (Nairne, 2002), do not seem to play a major role in our experiments. We would

expect attention to affect clapstick and vocal rhythms and same/different conditions in the same

way, whereas our results show different deterioration rates for the clapstick-vocal rhythms as

well as for same/different conditions. Additionally, Mercer and McKeown (2014) have shown

that refocusing attention in delayed decision experiments through the introduction of alert signals

does not improve memory performance; degrading attention does not seem to be a cause for the

observed decay. Though our study does not directly contrast interference and decay, retroactive

interference (McGeoch, 1932) can be excluded as our experiments 1 and 4 used neither

distractors nor interference from irrelevant stimuli during the retention period, and decisions on

each stimulus pair were made before a new pair was presented. Consequently, we take the results

from our experiments as evidence in favour of decay in short-term rhythm memory.

The idea that memory traces fade with time, however, has been strongly contested since

McGeoch (1932) proposed his interference theory, and almost all claims about decay have turned

out to be explainable with interference theory (Berman, Jonides, & Lewis, 2009; Nairne, 2002),

with a few notable exceptions. In a pitch matching decision task Harris (1952) found an

increasing decline in performance with increasing retention intervals, and no convincing


alternative explanation has so far been offered for his results. Mercer and McKeown (2014)

have recently demonstrated trace decay in a timbre matching and a pitch matching memory

experiment. Notably these studies, including the present one, are short-term memory tasks on

nonverbal stimuli: pitch, timbre, and rhythm (Harris, 1952; Mercer & McKeown, 2014). In

contrast, most studies that have failed to convincingly demonstrated memory trace decay have

focused on retention of verbal material and the multiple components of this mechanism; that is,

they have used stimuli that are susceptible to verbal coding and participants could easily engage

in rehearsals to maintain memory–a strategy less applicable to nonverbal stimuli.

Concerning the time course of decay our conclusions are limited because we used only

two retention periods. Nonetheless, our results are compatible with those of Harris (1952) and

Mercer and McKeown (2014), identifying decay for a period of at least 15 to 30 seconds. This

contrasts with Baddeley’s (1990) hypothesis that decay is limited to the first few (ca. 5) seconds

of the retention period. These differences may have to do with the different stimulus material

considered (verbal vs non-verbal) and the different encodings involved. However, Mercer and

McKeown’s (2014) second experiment, using a masking paradigm, indicates that the decay

observed at longer retention periods is most likely due to more factors than simply the effect of

sensory memory fading.

Harris (1952) attempted to explain the depreciation of memory performance by alluding

to the decline of neuronal activity resulting from the stimuli. More recently, Jonides et al., (2008)

have proposed another model and preliminary evidence for the link between neuronal encoding

and decay, but the link between neural activity decline and memory performance was not very

strong and their model will no doubt undergo further testing and specification. Our study points

to yet another factor shaping memory decay, the form of encoding. As discussed above, different

decay rates were found for vocal and clapstick rhythms and also between musicians’ and non-

musicians’ responses. Results from our second experiment suggest that these differences are not

due to rehearsal, nor is it likely they can be attributed to other maintenance strategies like

refreshing (Raye, Johnson, Mitchell, Greene, & Johnson, 2007). We therefore suggest these

differences may be related to differences in pre-categorical (sensory memory) encoding and/or

timing differences of subsequent categorical encoding: the relatively uncomplicated temporal

features of clapstick sound sequences might lead to different sensory encodings and allow for

more rapid categorical recoding, thus offering more stable memory traces, than the complex

vocal sound sequences. Similarly, as discussed earlier, the training of musicians that allows for

different (Aleman et al., 2000) and faster (Kraus & Chandrasekaran, 2010) categorical recoding

may contribute to their better and more stable memory performance. One plausible explanation

would be that the decay rates reported in the present study were largely shaped by the degree and

extent to which pre-categorical sensory and categorical encodings are involved in the memory

processes.


Musical Training and Memory for Rhythm

Throughout all four experiments in this study, musicians were more accurate than non-musicians

in memory for rhythm tasks. This difference varied according to the task performed, and was

much larger when the participants were asked to reproduce the rhythm (25.7 pp) than when they

were asked to recognize a difference in the rhythm (7.7 - 10.2 pp). The reproduction of a rhythm

relies on both on the extraction of and memory for the rhythm – our main interests – and also on

the ability to accurately reproduce a rhythm. Musical training likely confers a general advantage

on the latter, and it is possible that that explains part of the difference between the reproduction

and recognition experiments. Clearly, however, in our study the musicians were better able to

extract and remember the clapstick and vocal rhythms we employed.

On the one hand, musical training seems to contribute to faster extraction of timing

information (Kraus & Chandrasekaran, 2010) and also improves auditory processing in some

brain areas (i.e. planum temporale and dorso-lateral prefrontal cortex) that are also used in

language processing (Franklin et al., 2008; Ohnishi et al., 2001). On the other hand, their training

may enable professional musicians to employ different forms of mental representation of

rhythms than non-musicians (Aleman, Nieuwenstein, Böcker, & de Haan, 2000; Schaal et al.,

2015; Palmer & Krumhansl, 1990). Musicians could use forms of visuospatial representations to

memorize rhythms because they have been trained to transfer rhythms into musical notation.

Musical training may enable musicians to use this symbolic system which may be effective in

extracting rhythms quickly and storing them in more stable and easier to recall representations

(Brodsky, Kessler, Rubinstein, Ginsborg, & Henik, 2008). During the participant debriefing after

experiment 4, several musicians reported that they had combined instrumental rhythms with

images of cello performances in order to extract and repeat the rhythm. This suggests that

musicians may also transform instrumental sounds to either visual images or bodily movements

that support or improve the retention of rhythmic patterns and points towards an interesting area

for further research.

Differences Between Vocal and Instrumental Rhythm Processing

What are the factors that shape the differential processing of our two rhythm types? We have

already discussed above that attention does not seem to account for a significant part of these

differences, though it is likely to play a general modulatory role on subjects’ responses. By

having the two experimental tasks performed on identical sets of stimuli (experiments 1, 2, and

4), we effectively equated task-related attentional demands between them. Familiarity, another

potential confound, also seems to be an unlikely explanation. Although the musical components

of our stimuli were not uncommon, the music excerpts were taken from a music and language

culture none of the subjects was familiar with or had ever heard before. Rather, the present study

hints at three different factors contributing to the differential processing of vocal and

instrumental rhythms.

The first one has already been referred to above and concerns the acoustical stimulus

features pertinent for a hypothesized rhythm detection and extraction (encoding) process (e.g.


Jones, 1976; Jones & Boltz, 1989; Large & Jones, 1999). The temporal features relevant for this

process appear easier to obtain from clapsticks than from vocal signals, perhaps because the

former have minimum variation in their spectral and pitch components. The task of obtaining the

temporal information from clapsticks could basically be reduced to a determination of the

amplitude peak or onset sequences. Extraction of vocal rhythm, on the other hand, requires

simultaneous consideration of multiple features, the identification of relevant changes in

dynamics, pitch, and spectra. This factor probably explains at least part of the accuracy

differences for vocal and clapstick rhythms in the present study. Music makes use of a wide

range of instrumental rhythms, from clapstick-like percussion rhythms to more voice-like

rhythms of bowed string instruments. The degree to which these are processed differently from

vocal rhythms needs to be assessed in further studies comparing a wider range of instrumental

and vocal rhythms. However, it is unlikely that the complexity issue can account for the full

range of differences identified here because there are other features that distinguish vocal from

instrumental sound sequences (see Vouloumanos et al., 2001 for a similar argument concerning

processing differences between speech and non-speech sounds). For example, the human voice

has a typical spectral energy distribution easily distinguished from most musical instruments.

Furthermore, there is a relative independence of fundamental frequency of the voice and the

resonance characteristic of the vocal tract that allows for the generation of sound sequences

impossible to produce on musical instruments (Fant, 1960). These features, and the implication

they have for cognitive processing, lead us to the next point.

The second factor refers to the findings of various studies, showing that the human brain

possesses specializations for the processing of human vocal sounds (Belin et al., 2000; Bent et

al., 2006; Levy et al., 2001, 2003; Vouloumanos et al., 2001; Zatorre et al., 2002) and vocal

rhythms (Hung, 2011). These specializations can be understood as neuronal and cognitive

adaptations to the acoustic complexity and vital biological role of vocal sounds in intra-species

communication (Wang, 2000). For humans, quickly identifying sounds as vocal and processing

them as speech sounds can be crucial in social context, and it is something humans perform

effortlessly and automatically. In line with this, all imaging studies mentioned here found

stronger and partly different patterns of brain activation for processing of human voice than for

non-voice sounds. Theories of speech processing like the duplex theory (Whalen & Liberman,

1987) explain this specialization by hypothesizing different parallel processors operating on the

auditory input. This specialization and the possible ensuing preference for processing of human

voice help to explain the faster reaction times to vocal rhythms, despite their greater complexity,

when reaction times for single rhythms are compared (experiment 4) or when dual rhythms are

processed under additional cognitive load by a distractor task (experiment 2).

The third and final aspect we want to address here connects with the previous one.

Research on sensory motor integration in the auditory system (Pa & Hickok, 2008; Wang, 2000)

suggests that acoustical differences between vocal and non-vocal sounds and cortical processing

adaptations may not be the only factors to explain different processing of the respective rhythms.

Rather, as vocal and instrumental rhythms are produced via different motor-effectors in the body,


their processing leads to specific sensory-motor activations of different neural substrates and

different associated encodings (see also: Wang, 2000). Such distinct forms of sensory-motor

representation are, therefore, further likely candidates to explain processing differences between

vocal and instrumental rhythms. The contrast between decision and reproduction (experiments 2

and 3) provides the first hints at how encoding differences may be linked with different sensory-

motor activations, and this connection deserves further explorations in future studies employing

different methods.

Differential memory processing of vocal and instrumental rhythms, as discussed here,

indicates that short-term memorization of temporal information is a multi-component process in

which selection and working of components is influenced by the required tasks, participant

strategies, attention, as well as stimulus content and context. This is consistent with recently

formulated models of time processing (Coull et al., 2010; Ivry & Schlerf, 2008; Wiener et al.,

2011) as comprising multiple processing networks in which the actual processes employed are

task and context dependent, and these processes can furthermore be modulated by factors like

attention. Our results are not only compatible with these multi-process models but also seem to

offer additional support for them from rhythm memory research.

Implications for Musical Rhythm Research

The identification of processing differences and the evidence for multiple, different forms of

encoding of vocal and instrumental rhythms challenge the idea of musical rhythm as an abstract

feature of sound sequences: rhythm processing appears to have an important perceptual

component. How we process and experience rhythms is influenced by the specific sounds that

form those rhythms – a characteristic that has been largely ignored by musical rhythm research.

Results of the current study are one line of supporting evidence for the non-unitary nature and

diverse evolutionary origins of music. The distinction between vocal and instrumental rhythm

may be a reflection of their different origins in relation to the human body – one produced

actively inside the body, the other through limb action on external objects – as well as their

different significance in human interaction and communication. In his comparative approach,

combining cross-cultural, intra-specific and inter-specific components, Fitch (2006) emphasized

that what is generally called the ‘music faculty’ actually consists of various components that may

have very different evolutionary histories, and that talking about ‘music’ as a unitary

phenomenon risks obscuring these histories and preventing an understanding of the origins and

development music. He proposed a multi-component view in which vocal and instrumental

music are the central components, and he discussed various lines of evidence – from the design

features of music and language to the evolution of analogous and homologous behavioural traits

– in favour of this view. The existence of brain specializations that lead to differential processing

of vocal and non-vocal sounds, of vocal and non-vocal melodic contours, and vocal and

instrumental rhythms constitute strong support for such a view.


References

Aleman, A., Nieuwenstein, M. R., Böcker, K. B., & de Haan, E. H. (2000). Music Training and

Mental Imagery Ability. Neuropsychologia, 38(12), 1664-1668. doi:10.1016/s0028-

3932(00)00079-8

Alvarez, C.J., Cottrell, D., Afonso, O. (2009). Writing dictated words and picture names. Applied

Psycholinguistics, 30, 205-223. doi:10.1017/s0142716409090092

Baddeley, A.D. (1990). Human Memory: Theory and Practice. Oxford: Oxford University Press.

Baddeley, A.D. (2010). Working Memory. Current Biology, 20(4), R136–R140.

doi:10.1016/j.cub.2009.12.014

Baddeley, A., Eldridge, M., & Lewis, V. (1981). The role of subvocalisation in reading. The

Quarterly Journal of Experimental Psychology Section A, 33(4), 439–454.

doi:10.1080/14640748108400802

Baddeley, A.D., & Hitch, G. J. (1994). Developments in the Concept of Working Memory.

Neuropsychology, 8, 485-493. doi:10.1037/0894-4105.8.4.485

Baddeley, A. D., Thomson, N., & Buchanan, M. (1975). Word Length and the Structure of

Short-Term Memory. Journal of Verbal Learning and Verbal Behavior, 14(6), 575-589.

doi:10.1016/s0022-5371(75)80045-4

Baguley, T. (2009) Standarized or simple effect size: What should be reported. British Journal of

Psychology, 100, 603-617. doi:10.1348/000712608X377117

Bagnara, S., Boles, D.B., Simion, F. & Umiltà, C. (1982) Can an analytic/holistic dichotomy

explain hemispheric asymmetries? Cortex, 18, 67-78. doi:10.1016/s0010-9452(82)80019-

1

Bates, D. M. (2010). lme4: Mixed-Effects Modeling with R. New York: Springer. Prepublication

version at: http://lme4.r-forge.r-project.org/book/

Bates, D., Maechler, M., Bolker, B. & Walker, S. (2014). lme4: Linear mixed-effects models

using Eigen and S4. R package version 1.1-7, http://CRAN.R-project.org/package=lme4.

Belin, P., Zatorre, R. J., Lafaille, P., Ahad, P., & Pike, B. (2000). Voice-Selective Areas

in Human Auditory Cortex. Nature, 403(6767), 309-312. doi:10.1038/35002078

http://lme4.r-forge.r-project.org/book/

http://cran.r-project.org/package=lme4


Bent, T., Bradlow, A. R., & Wright, B. A. (2006). The Influence of Linguistic

Experience on the Cognitive Processing of Pitch in Speech and Non-Speech Sounds.

Journal of Experimental Psychology: Human Perception and Performance, 32(1), 97-

103. doi:10.1037/0096-1523.32.1.97

Berman, M.G., Jonides, J., & Lewis, R.L. (2009). In Search of Decay in Verbal Short Term

Memory. Journal of Experimental Psychology: Learning, Memory and Cognition, 35(2),

317-333. doi:10.1037/a0014873

Binder, J. R., Frost, J. A., Hammeke, T. A., Bellgowan, P. S. F., Springer, J. A., Kaufman, J. N.,

& Possing, E. T. (2000). Human Temporal Lobe Activation by Speech and Nonspeech

Sounds. Cerebral Cortex, 10(5), 512 -528. doi:10.1093/cercor/10.5.512

Briggs, G. E., & Johnson, A. M. (1973). On the Nature of Central Processing in Choice

Reactions. Memory & Cognition, 1(1), 91-100. doi:10.3758/bf03198076

Brodsky, W., Kessler, Y., Rubinstein, B. S., Ginsborg, J., & Henik, A. (2008). The Mental

Representation of Music Notation: Notational Audiation. Journal of Experimental

Psychology: Human Perception and Performance, 34(2), 427. doi:10.1037/0096-

1523.34.2.427

Coull, J.T., Cheng, R.K. & Meck, W.H. (2010). Neuroanatomical and Neurochemical Substrates

of Timing. Neuropsychopharmacology Reviews. 36(1), 3-25. doi:10.1038/npp.2010.113

Cowan, N. (1984). On Short and Long Auditory Stores. Psychological Bulletin, 96(2), 341-

370. doi:10.1037//0033-2909.96.2.341

Cowan, N. (1997). Attention and Memory: An Integrated Framework. New York: Oxford

Psychology Series.

Cowan, N. (2008). What are the Differences Between Long-term, Short-term, and Working

Memory? Progress in Brain Research, 169, 323-338. doi:10.1016/s0079-6123(07)00020-

9

Craik, F.I.M. & Kester, J.D. (1999). Divided Attention and Memory: Impairment of Processing

or Consolidation? In E. Tulving (Ed.), Memory, consciousness, and the brain: The Tallin

conference. Philadelphia: Psychology Press. pp. 38-51.


Deutsch, D. (1986). Recognition of Durations Embedded in Temporal Patterns. Perception &:

Psychophysics, 39(3), 179-186. doi:10.3758/bf03212489

Dixon, R.M.W. & Koch, G. (1996). Dyirbal Song Poetry Traditional Songs of an Australian

Rainforest People. [CD]. Mascot, N.S.W : Larrikin.

Fant, G. (1960) Acoustic Theory of Speech Production. The Hague (The Netherlands): Mouton.

Finnigan, S., Humphreys, M.S., Dennis, S., & Geffen, G. (2002). ERP ‘old/new‘ effects:

memory strength and decisional factor(s). Neuropsychologia, 40, 2288-2304.

doi:10.1016/s0028-3932(02)00113-6

Fitch, W.T. (2006). The Biology and Evolution of Music: a Comparative Perspective.

Cognition, 100(1), 173-215. doi:10.1016/j.cognition.2005.11.009

Forster, K. I., & Forster, J. C. (2003). DMDX: A Windows Display Program with Millisecond

Accuracy. Behavior Research Methods, 35(1), 116–124. doi:10.3758/bf03195503

Franklin, M. S., Moore, K. S., Yip, C.-Y., Jonides, J., Rattray, K., & Moher, J. (2008). The

Effects of Musical Training on Verbal Memory. Psychology of Music, 36( 3), 353-365.

doi:10.1177/0305735607086044

Gaser, C., & Schlaug, G. (2003). Brain Structures Differ between Musicians and Non-

Musicians. The Journal of Neuroscience, 23(27), 9240-9245.

Glenberg, A.M. & Jona, M. (1991). Temporal Coding in Rhythm Tasks Revealed by Modality

Effects. Memory and Cognition, 19 (5), 514-522. doi:10.3758/bf03199576

Goldstone, S., & Goldfarb, J. L. (1963). Judgment of Filled and Unfilled Durations: Intersensory

Factors. Perceptual and Motor Skills, 17(3), 763-774. doi:10.2466/pms.1963.17.3.763

Halliday, M. S., Hitch, G. J., Lennon, B., & Pettipher, C. (1990). Verbal short-term memory in

children: The role of the articulator loop. European Journal of Cognitive Psychology,

2(1), 23–38. doi:10.1080/09541449008406195

Harris, J. D. (1952). The Decline of Pitch Discrimination with Time. Journal of Experimental

Psychology, 43(2), 96–99. doi:10.1037/h0057373


Hillenbrand, J. M., & Nearey, T. M. (1999). Identification of Resynthesized /hVd/ Utterances:

Effects of Formant Contour. The Journal of the Acoustical Society of America, 105(6),

3509-3523. doi:/10.1121/1.424676

Hung, T. H. (2011). One music? Two musics? How many musics? Cognitive

Ethnomusicological, Behavioral, and fMRI Study on Vocal and Instrumental Rhythm

Processing (Doctoral dissertation). The Ohio State University, Columbus OH.

http://rave.ohiolink.edu/etdc/view?acc_num=osu1308317619

Ivry, R.B. & Schlerf, J.E. (2008). Dedicated and intrinsic models of time perception. Trends in

Cognitive Sciences 12: 273–280. doi:10.1016/j.tics.2008.04.002

Jakobson, L. S., Lewycky, S. T., Kilgour, A. R., & Stoesz, B. M. (2008). Memory for Verbal and

Visual Material in Highly Trained Musicians. Music Perception: An Interdisciplinary

Journal, 26(1), 41-55. doi:10.1525/mp.2008.26.1.41

James, W. (1890). The Principles of Psychology (Vols. 1-2 ). New York: Holt.

Johns, B.T., Jones, M.N., & Mewhort, D.J.K. (2012). A synchronization account of false

recognition. Cognitive Psychology, 65(4), 486-518. doi:10.1016/j.cogpsych.2012.07.002

Jones, M. R. (1976). Time, our lost dimension: Toward a new theory of perception, attention,

and memory. Psychological Review, 83(5), 323–355. doi:10.1037/0033-295x.83.5.323

Jones, M. R., & Boltz, M. (1989). Dynamic attending and responses to time. Psychological

Review, 96(3), 459–491. doi:10.1037/0033-295x.96.3.459

Jonides, J., Lewis, R.L., Nee, D.E., Lustig, C.A., Berman, M.G. & Moore, K.S. (2008). The

Mind and Brain of Short-Term Memory. Annual Review of Psychology, 59, 193-224.

doi:10.1146/annurev.psych.59.103006.093615

Keller, T. A., Cowan, N., & Saults, J. S. (1995). Can Auditory Memory for Tone Pitch be

Rehearsed? Journal of Experimental Psychology: Learning, Memory, and Cognition,

21(3), 635. doi:10.1037//0278-7393.21.3.635

Keuss, P.I.G. (1977). Processing of Geometrical Dimensions in a Binary Classification Task:

Evidence for a Dual Process Model. Perception & Psychophysics, 21(4), 371-376.

doi:10.3758/bf03199489

http://dx.doi.org/10.1146/annurev.psych.59.103006.093615


Kilgour, A. R., Jakobson, L. S., & Cuddy, L. L. (2000). Music Training and Rate of Presentation

as Mediators of Text and Song Recall. Memory & Cognition, 28(5), 700-710.

doi:10.3758/bf03198404

Koelsch, S., Schröger, E., & Tervaniemi, M. (1999). Superior Pre-Attentive Auditory Processing

in Musicians. Neuroreport, 10(6), 1309-1313. doi:10.1097/00001756-199904260-00029

Kraus, N., & Chandrasekaran, B. (2010). Music Training for the Development of Auditory

Skills. Nature Reviews Neuroscience, 11(8), 599-605. doi:10.1038/nrn2882

Krueger, L.E. (1978). A Theory of Perceptual Matching. Psychological Review, 85(4), 278-

304. doi:10.1037//0033-295x.85.4.278

Kuznetsova, A., Brockhoff, P.B., & Christensen, R.H.B. (2014). lmerTest: Tests for random and

fixed effects for linear mixed effect models (lmer objects of lme4 package).. R package

version 2.0-11. http://CRAN.R-project.org/package=lmerTest

Large, E. W., & Jones, M. R. (1999). The dynamics of attending: How people track time-varying

events. Psychological Review, 106(1), 119–159. doi:10.1037/0033-295x.106.1.119

Levy, B.A. (1971). Role of Articulation in Auditory and Visual Short-Term Memory. Journal of

Verbal Learning and Verbal Behavior, 10(2), 123-132. doi:10.1016/s0022-

5371(71)80003-8

Levy, D., Granot, R., & Bentin, S. (2001). Processing Specificity for Human Voice

Stimuli: Electrophysiological Evidence. NeuroReport, 12(12), 2653-2657.

doi:10.1097/00001756-200108280-00013

Levy, D., Granot, R., & Bentin, S. (2003). Neural Sensitivity to Human Voices: ERP Evidence

of Task and Attentional Influences. Psychophysiology, 40(2), 291–305.

doi:10.1111/1469-8986.00031

MacMillan, N., & Creelman, D. (2005). Detection Theory: A User’s Guide (2nd ed.). Mahwah,

New Jersey: Lawrence Erlbaum Associates, Inc.

Malmberg, K.J., & Xu, J. (2007). On the flexibility and on the fallibility of associative

memory. Memory & Cognition, 35(3), 545–556. doi:10.3758/bf03193293

Markman, A.B., & Gentner, D. (2005). Nonintentional Similarity Processing. In R.

Hassin, J.A. Bargh, & J.S. Uleman (Eds.) The New Unconscious (pp. 107-137). New


York: Oxford University Press. doi:10.1093/acprof:oso/9780195307696.003.0006

Mathias, S., Micheyl, C., & Shinn-Cunningham, B. (2014). Gradual Decay of Auditory Short-

Term Memory. Journal of the Acoustical Society of America, 135(4.2), 2412.

doi:10.1121/1.4878005

McGeoch, J. (1932). Forgetting and the Law of Disuse. Psychology Review, 39(4), 352-370.

doi:10.1037/h0069819

Mercer, T., & McKeown, D. (2014). Decay Uncovered in Nonverbal Short-Term Memory.

Psychonomic Bulletin & Review, 21(1), 128–135. doi:10.3758/s13423-013-0472-6

Mewhort, D. J. K., & Johns, E. E. (2005). Sharpening the echo: An iterative-resonance model for

short-term recognition memory. Memory, 13, 300–307. doi:10.1080/09658210344000242

Morton, J. (1970). A Functional Model of Memory. In D. A. Norman (Ed.), Models of Human

Memory (pp. 203-254). New York: Academic Press. doi:10.1016/b978-0-12-521350-

9.50012-7

Musacchia, G., Sams, M., Skoe, E., & Kraus, N. (2007). Musicians have Enhanced Subcortical

Auditory and Audiovisual Processing of Speech and Music. Proceedings of the National

Academy of Sciences, 104(40), 15894 -15898. doi:10.1073/pnas.0701498104

Musacchia, G., Strait, D., & Kraus, N. (2008). Relationships Between Behavior, Brainstem and

Cortical Encoding of Seen and Heard Speech in Musicians and Non-Musicians. Hearing

Research, 241(1-2), 34-42. doi:10.1016/j.heares.2008.04.013

Nairne, J.S. (2002). Remembering Over the Short-Term: The Case Against the Standard Model.

Annual Review of Psychology, 53(1), 53-81.

doi:10.1146/annurev.psych.53.100901.135131

Neath, I., & Surprenant, A. M. (2003). Human Memory: an Introduction to Research, Data, and

Theory (2nd ed.). Thomson/Wadsworth.

Neisser, U. (1967). Cognitive Psychology. East Norwalk, CT: Appleton,Century,Crofts.

Ohnishi, T., Matsuda, H., Asada, T., Aruga, M., Hirakata, M., Nishikawa, M., Katoh, A. &

Imabayashi, E. (2001). Functional Anatomy of Musical Perception in Musicians.

Cerebral Cortex, 11(8), 754–760. doi:10.1093/cercor/11.8.754

http://dx.doi.org/10.1093/


Pa, J. & Hickok, G. (2008). A Parietal-Temporal Sensory-Motor Integration Area for the Human

Vocal Tract: Evidence from an fMRI Study of Skilled Musicians. Neuropsychologia,

46(1), 362-368. doi: 10.1016/j.neuropsychologia.2007.06.024

Parbery-Clark, A., Skoe, E., Lam, C., & Kraus, N. (2009). Musician Enhancement for Speech-

In-Noise. Ear and Hearing, 30(6), 653-661. doi:10.1097/aud.0b013e3181b412e9

Palmer, C., & Krumhansl, C. L. (1990). Mental Representations for Musical Meter. Journal of

Experimental Psychology: Human Perception and Performance, 16(4), 728.

doi:10.1037//0096-1523.16.4.728

Papagno, C., Valentine, T., & Baddeley, A. (1991). Phonological Short-Term Memory and

Foreign-Language Vocabulary Learning. Journal of Memory and Language, 30, 331-347.

doi:10.1016/0749-596x(91)90040-q

Poss, N. F. (2012). Hmong Music and Language Cognition: An Interdisciplinary Investigation

(Doctoral dissertation). The Ohio State University, Columbus OH.

http://rave.ohiolink.edu/etdc/view?acc_num=osu1332472729

Poss, N., Hung, T.H., & Will, U. (2008). The Effects of Tonal Information on Lexical

Activation in Mandarin Speakers. In Proceedings of the 20th North American Conference

on Chinese Linguistics, vol. 1 (NACCL-20, Columbus, OH: The Ohio State University,

2008), 205-211.

Povel, D.J. (1981). Internal Representation of Simple Temporal Patterns. Journal of

Experimental Psychology: Human Perception &: Performance,7(1), 3-18.

doi:10.1037//0096-1523.7.1.3

Povel, D.J., & Essens, P. (1985). Perception of Temporal Patterns. Music Perception 2(4),

411-440. doi:10.2307/40285311

Proctor, R.W. (1981). A Unified Theory for Matching-Task Phenomena. Psychological Review,

88(4), 291-326. doi:10.1037//0033-295x.88.4.291

R Development Core Team (2010). R: A language and environment for statistical computing. R

Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.

http://www.R-project.org.

http://dx.doi.org/10.1016/


Rammsayer, T. H., & Lima, S. D. (1991). Duration Discrimination of Filled and Empty Auditory

Intervals: Cognitive and Perceptual Factors. Perception & Psychophysics, 50(6), 565-

574. doi:10.3758/bf03207541

Rammsayer, T. H., & Skrandies, W. (1998). Stimulus Characteristics and Temporal Information

Processing: Psychophysical and Electrophysiological Data. Journal of Psychophysiology,

12(1), 1-12.

Rauschecker, J. P., Tian, B., & Hauser, M. (1995). Processing of Complex Sounds in the

Macaque Nonprimary Auditory Cortex. Science, 268(5207), 111-114.

doi:10.1126/science.7701330

Raye, C. L., Johnson, M. K., Mitchell, K. J., Greene, E. J., & Johnson, M. R. (2007). Refreshing:

A Minimal Executive Function. Cortex, 43(1), 135–145. doi:10.1016/s0010-

9452(08)70451-9

Repp, B. H., & Bruttomesso, M. (2009). A Filled Duration Illusion in Music: Effects of Metrical

Subdivision on the Perception and Production of Beat Tempo. Advances in Cognitive

Psychology, 5, 114. doi:10.2478/v10053-008-0071-7

Saito, S., & Ishio, A. (1998). Rhythmic Information in Working Memory: Effects of Concurrent

Articulation on Reproduction of Rhythms. Japanese Psychological Research, 40(1), 10–

18. doi:10.1111/1468-5884.00070

Salamé, P., & Baddeley, A. (1982). Disruption of Short-Term Memory by Unattended Speech:

Implications for the Structure of Working Memory. Journal of Verbal Learning and

Verbal Behavior, 21(2), 150-164.

doi:10.1016/s0022-5371(82)90521-7

Schaal, N. K., Banissy, M. J., & Lange, K. (2014). The Rhythm Span Task: Comparing Memory

Capacity for Musical Rhythms in Musicians and Non-Musicians. Journal of New Music

Research, 44(1), 3–10. doi:10.1080/09298215.2014.937724

Snyder, B. (2000). Music and Memory: An Introduction. MIT Press.

ter Schure, S., Chládková, K., & van Leussen, J. W. (2011). Comparing Identification of

Artificial and Natural Vowels. Paper presented at the 17th International Congress of

Phonetic Sciences, Hong Kong. Retrieved from http://dare.uva.nl/document/2/101523


Tervaniemi, M., Rytkönen, M., Schröger, E., Ilmoniemi, R. J., & Näätänen, R. (2001). Superior

Formation of Cortical Memory Traces for Melodic Patterns in Musicians. Learning &

Memory, 8(5), 295 -300. doi:10.1101/lm.39501

University of Iowa Electronic Music Studio. (2012). Cello. Retrieved from

http://theremin.music.uiowa.edu/MIScello.html

Vouloumanos, A., Kiehl, K.A., Werker, J.F., & Liddle, P. F. (2001). Detection of Sounds in the

Auditory Stream: Event-Related fMRI Evidence for Differential Activation to Speech

and Nonspeech. Cognitive Neuroscience, 13(7), 944-1005.

doi:10.1162/089892901753165890

Wang X. (2000). On Cortical Coding of Vocal Communication Sounds in Primates. Proceedings

of the National Academy of Sciences, 97(22), 11843–11849.

doi:10.1073/pnas.97.22.11843

Whalen, D. H., & Liberman, A. M. (1987). Speech Perception Takes Precedence over

Nonspeech Perception. Science, 237(4811), 169–171.

doi:10.1126/science.3603014

Wiener, M., Matell, M.S., & Coslett, H.B. (2011). Multiple Mechanism for Temporal

Processing. Frontiers in Integrative Neuroscience. 5(31).

doi:10.3389/fnint.2011.00031.

Will, U., Nottbusch, G. Weingarten, R. (2006) Linguistic units in word typing. Written Language

& Literacy, 9(1), 153-176. http://dx.doi.org/10.1075/wll.9.1.10wil

Wright, D.B., Horry, R, & Skagerberg, E.M. (2009). Functions for traditional and multilevel

approaches to signal detection theory. Behavior Research Methods, 41 (2), 257-267.

doi:10.3758/BRM.41.2.257

Zatorre, R.J. (1998). How Do Our Brains Analyze Temporal Structure in Sound? Nature

Neuroscience (1)5, 343-345.

Zatorre, R.J., Belin, P., & Penhune, V.B. (2002). Structure and Function of Auditory Cortex:

Music and Speech. Trends in Cognitive Science, 6(1), 37-45.

doi:10.1016/s1364-6613(00)01816-7


Table 1: ANOVA summary of RT measurements from experiment 1, with between-subject

factor TRAINING (TRAIN) and within-subject factors RETENTION (RET), TIMBRE (TIMB),

CONGRUITY (CONG). Significant values are in bold.

Table 2a: GLMM ANOVA for minimal fitted model for experiment 1, with fixed effects

TRAINING (TRAIN); RETENTION (RET); TIMBRE (TIMB); CONGRUITY (CONG),

significant values in bold. 2b: Model estimates and z values (only significant parameters are

displayed). TRAINn = Training (non-musician); TIMBv = TIMBRE (voice); CONGs =

CONGRUITY (same).

Factor

Anova with Satterthwaite approximation of df

F value df. Satt P (>F)

TRAIN 4.83 23.0 0.038

RET 14.25 3643.5 <0.001

TIMB 45.70 3643.5 <0.001

CONG 6.89 3642.2 0.009

TRAIN:TIMB 13.71 3643.5 <0.001

RET:TIMB 5.95 3643.5 0.015

TIMB:CONG 8.59 3642.2 0.003

TRAIN:TIMB:CONG 3.09 3642.2 0.048

RET:TIMB:CONG 1.27 3642.2 0.281

TRAIN:RET:TIMB:CONG 3.51 3642.8 0.007

Factor F value P value

TRAIN F1,23<0.01 0.996

RET F1,23=124.85 <0.001

TIMB F1,23 = 0.06 0.816

CONG F1,23 = 382.47 <0.001

TRAIN:RET F1,23 = 1.09 0.307

TRAIN:TIMB F1,23=4.84 0.012

TRAIN:CONG F1,23 = 0.15 0.706

RET:TIMB F1,23<0.01 0.988

RET:CONG F1,23<0.17 0.686

TIMB:CONG F1,23<0.01 0.908

TRAIN:RET:TIMB F1,23<0.44 0.512

TRAIN:RET:CONG F1,23<0.01 0.958

TRAIN:TIMB:CONG F1,23=4.09 0.055

RET:TIMB:CONG F1,23<1.03 0.312

TRAIN:RET:TIMB:CONG F1,23<0.75 0.396

Parameter (fixed effects) Estimate z value P (>|z|)

TRAINn -1.167 -3.798 <0.001

TIMBv -1.713 -7.208 <0.001

CONGs -0.955 -3.882 <0.001

TRAINn:TIMBv 1.257 3.779 <0.001

RETs:TIMBv 0.837 2.493 0.013

TIMBv:CONGs 1.254 4.103 <0.001

TRAINn:TIMBv:CONGs -0.583 -2.060 0.039

RETs:TIMBv:CONGs -0.719 -2.713 0.006

TRAINn:RETs:TIMBv:CONGs 0.879 2.996 0.003


Table 3: ANOVA summary of RT measurements for experiment 2, with between-subject factor

TRAINING (TRAIN) and within-subject factors DISTRACTOR (DISTR), TIMBRE (TIMB),

CONGRUITY (CONG), and significant values in bold.


TRAINING (TRAIN); DISTRACTOR (DIST); TIMBRE (TIMB); CONGRUITY (CONG),

significant values in bold. 4b: Model estimates and z values. TRAINm = TRAINING

(musician); DISTRt = DISTRACTOR (tapping); DISTsv = DISTRACTOR (subvocal); TIMBv

= TIMBRE (voice); only significant parameter are displayed.

Factor



TRAIN 4.47 22.0 0.035

DISTR 26.80 5293.8 <0.001

TIMB 114.65 5293.8 <0.001

CONG 56.16 5293.8 <0.001

DISTR:TIMB 3.81 5293.8 0.022

DISTR:CONG 3.43 5293.8 0.032

TIMB:CONG 22.50 5293.8 <0.001


TRAIN F1,21=1.03 0.322

DISTR F2,42=0.55 0.583

TIMB F1,21=9.16 0.006

CONG F1,21=370.11 <0.001

TRAIN:DISTR F2,42= 0.09 0.915

TRAIN:TIMB F1,21=0.03 0.871

TRAIN:CONG F1,21 < 0.01 0.978

DISTR:TIMB F2,42=1.15 0.326

DISTR:CONG F2,42=4.10 0.024

TIMB:CONG F1,21<0.47 0.500

TRAIN:DISTR:TIMB F2,42=1.85 0.169

TRAIN:DISTR:CONG F2,42=4.99 0.011

TRAIN:TIMB:CONG F1,21=0.47 0.465

DISTR:TIMB:CONG F2,42=0.43 0.656

TRAIN:DISTR:TIMB:CONG F2,42=0.03 0.467

Parameter (fixed effects) Estimate Z value P (>|z|)

TRAINm 0.391 2.359 0.018

DISTRt -1.008 -6.906 <0.001

DISTRsv -0.950 -6.510 <0.001

TIMBv -1.260 -9.233 <0.001

DISTRt:TIMBv 0.419 2.623 0.009

DISTRsv:TIMBv 0.418 2.630 0.009

DISTRt:CONGs 0.459 2.915 0.004

TIMBv:CONGs 0.596 4.739 <0.001



TRAINING (TRAIN); DISTRACTOR (DISTR); TIMBRE (TIMB) and subjects as random

effects. Significant values are in bold. 5b: Model estimates and z values. TRAINn =

TRAINING (non-musician); DISTRsv = DISTRACTOR (subvocal); TIMBv = TIMBRE

(voice).

Factor



TRAIN 6.36 24.0 0.012

DISTR 5.78 3712.1 0.003

TIMB 641.15 3712.1 <0.001

Table 6: ANOVA summary of RT measurements for experiment 4, with factors TRAINING

(TRAIN), RETENTION (RET), RHYTHM (RHYT), CONGRUITY (CONG). Significant values

in bold.


TRAINn -1.492 -4.015 <0.001

DISTRsv -0.639 -4.347 <0.001

TIMBv -2.284 -25.191 <0.001


TRAIN F 1,20=7.70 0.012

RET F 1,20=45.17 <0.001

RHYT F 2,40=9.77 <0.001

CONG F1,20=502.55 <0.001

TRAIN:RET F1,20= 2.43 0.135

TRAIN:RHYT F2,40=0.65 0.527

TRAIN:CONG F1,20 = 3.91 0.062

RET:RHYT F2,40 = 3.21 0.058

RET:CONG F1,20=0.32 0.575

RHYT:CONG F2,40=29.68 <0.001

TRAIN:RET:RHYT F2,40=0.12 0.885

TRAIN:RET:CONG F1,20=1.08 0.311

TRAIN:RHYT:CONG F2,40=0.09 0.911

RET:RHYT:CONG F2,40=1.00 0.378

TRAIN:RET:RHYT:CONG F2,40=0.25 0.782



TRAINING (TRAIN); RETENTION (RET); RHYTHM (RHYT); CONGRUITY (CONG), and

significant values in bold. Table 7b: Model estimates and z values. TRAINm = TRAINING

(musician); RHYTf = RHYTHM (filled); RHYTv = RHYTHM (vocal); CONGs =

CONGRUITY (same); only significant parameters displayed.

Factor



TRAIN 7.92 20.0 0.011

RET 2.88 3557.7 0.090

RHYT 134.20 3557.4 <0.001

CONG 39.07 3557.9 <0.001

TRAIN:RHYT 2.03 3557.4 0.135

RET:RHYT 11.48 3557.4 <0.001

RET:CONG 83.97 3557.6 <0.001

RHYT:CONG 31.69 3557.4 <0.001

TRAIN:RHYT:CONG 10.94 3557.5 <0.001


TRAINm 0.957 2.723 0.006

RHYTf -2.769 -10.690 <0.001

RHYTv -1.507 -5.577 <0.001

CONGs 1.590 6.170 <0.001

RETs:RHYTv 0.582 1.986 0.047

RETs:CONGs 2.128 9.977 <0.001

RHYTf:CONGs 2.682 8.612 <0.001

RHYTv:CONGs 2.947 7.829 <0.001

TRAINm:RHYTf:CONGs -1.480 -5.471 <0.001

TRAINm:RHYTv:CONGs -1.744 -4.507 <0.001

Differential Short-Term Memorization for Vocal and Instrumental Rhythms

Documents

Transcript of Differential Short-Term Memorization for Vocal and Instrumental Rhythms