Reproducibility of fMRI in the clinical setting: Implications for trial designs

8
Reproducibility of fMRI in the clinical setting: Implications for trial designs R. Bosnell a , C. Wegner a , Z.T. Kincses a , T. Korteweg c , F. Agosta b , O. Ciccarelli d , N. De Stefano e , A. Gass g , J. Hirsch g , H. Johansen-Berg a , L. Kappos g , F. Barkhof, L. Mancini d , F. Manfredonia d , S. Marino e , D.H. Miller d , X. Montalban h , J. Palace a , M. Rocca b , C. Enzinger f , S. Ropele f , A. Rovira h , S. Smith a , A. Thompson d , J. Thornton d , T. Yousry d , B. Whitcher i , M. Filippi b , P.M. Matthews a,i,j, a Centre for Functional Magnetic Resonance Imaging of the Brain, University of Oxford, UK b Neuroimaging Research Unit, Department of Neurology, Scientic Institute and University, Ospedale San Raffaele, Milan, Italy c Department of Radiology, VU University Medical Centre, Amsterdam, The Netherlands d NMR Research Unit, Institute of Neurology, University College London, London, UK e Department of Neurological and Behavioural Sciences, University of Siena, Italy f Department of Neurology, Medical University Graz, Graz, Austria g Department of Neurology, University Hospital, Kantonsspital, Basel, Switzerland h Department of Radiology, Magnetic Resonance Unit, Hospital Vall d'Hebron, Barcelona, Spain i Department of Clinical Neurosciences, Imperial College, London, UK j Current address: Clinical Imaging Centre, Clinical Pharmacology and Discovery Medicine, GlaxoSmithKline, UK abstract article info Article history: Received 22 February 2007 Revised 26 April 2008 Accepted 7 May 2008 Available online 15 May 2008 Keywords: Multiple sclerosis Functional magnetic resonance imaging Clinical trial Plasticity With expanding potential clinical applications of functional magnetic resonance imaging (fMRI) it is im- portant to test how reliable different measures of fMRI activation are between subjects and sessions and between centres. This study compared variability across 17 patients with multiple sclerosis (MS) and 22 age- matched healthy controls (HC) in 5 European centres performing an fMRI block design with hand tapping. We recruited subjects from sites using 1.5 T scanners from different manufacturers. 5 healthy volunteers also were studied at each of 4 of the centres. We found that reproducibility between runs and sessions for single individuals was consistently much greater than between individuals. There was greater run-to-run variability for MS patients than for HC. Measurements of maximum signal change (MSC) appeared to provide higher reproducibility within individuals and greater sensitivity to differences between individuals than region of interest (ROI) suprathreshold voxel counts. The variability in measurements between centres was not as great as that between individuals. Consistent with these observations, we estimated that power should not be reduced substantially with use of multi-, as opposed to single-, centre study designs with similar numbers of subjects. Multi-centre interventional studies in which fMRI is used as an outcome measure thus appear practical even when implemented in conventional clinical environments. © 2008 Elsevier Inc. All rights reserved. Introduction Clinical applications of functional MRI (fMRI) are expanding rapidly (Matthews et al., 2006). In multiple sclerosis and stroke, for example, fMRI has been employed to identify altered networks involved in motor control and to identify potentially adaptive reorganisation and plasticity in ways that may contribute to establishing prognosis (Matthews, 2004; Rocca et al., 2005). In epilepsy surgery, fMRI enables iden- tication of hemispheric language dominance to aid surgical planning (Baciu et al., 2005). FMRI may be useful in predicting subsequent clinical progression in people at risk of developing Alzheimer's disease (Bookheimer et al., 2000). With this in- creasing use of fMRI in the clinical domain, and the possibility that results could inuence individual treatment (Bartsch et al., 2006), there is a need to assess the reproducibility of fMRI activation measures in the clinical setting. Expanding clinical applications of fMRI also raise the possibility for use of fMRI as an outcome measure in multi-centre studies of new therapies. It is therefore important to understand the extent to which results can be reproduced across different centres. There are many sources of variability in fMRI. At the level of the individual subject, the BOLD signal can be affected by psychological or physiological factors. Caffeine ingestion (Beh- zadi and Liu, 2006; Liu et al., 2004), sleep deprivation (Chee NeuroImage 42 (2008) 603610 Abbreviations: EDSS, Extended Disability Status Score; EPI, echo planar image; fMRI, functional MRI; MNI, Montreal Neurological Institute; PMd, dorsal premotor cortex; ROI, region of interest. Corresponding author. GSK Clinical Imaging Centre, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK. E-mail address: [email protected] (P.M. Matthews). 1053-8119/$ see front matter © 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.neuroimage.2008.05.005 Contents lists available at ScienceDirect NeuroImage journal homepage: www.elsevier.com/locate/ynimg

Transcript of Reproducibility of fMRI in the clinical setting: Implications for trial designs

NeuroImage 42 (2008) 603–610

Contents lists available at ScienceDirect

NeuroImage

j ourna l homepage: www.e lsev ie r.com/ locate /yn img

Reproducibility of fMRI in the clinical setting: Implications for trial designs

R. Bosnell a, C. Wegner a, Z.T. Kincses a, T. Korteweg c, F. Agosta b, O. Ciccarelli d, N. De Stefano e, A. Gass g,J. Hirsch g, H. Johansen-Berg a, L. Kappos g, F. Barkhof, L. Mancini d, F. Manfredonia d, S. Marino e, D.H. Miller d,X. Montalban h, J. Palace a, M. Rocca b, C. Enzinger f, S. Ropele f, A. Rovira h, S. Smith a, A. Thompson d,J. Thornton d, T. Yousry d, B. Whitcher i, M. Filippi b, P.M. Matthews a,i,j,⁎a Centre for Functional Magnetic Resonance Imaging of the Brain, University of Oxford, UKb Neuroimaging Research Unit, Department of Neurology, Scientific Institute and University, Ospedale San Raffaele, Milan, Italyc Department of Radiology, VU University Medical Centre, Amsterdam, The Netherlandsd NMR Research Unit, Institute of Neurology, University College London, London, UKe Department of Neurological and Behavioural Sciences, University of Siena, Italyf Department of Neurology, Medical University Graz, Graz, Austriag Department of Neurology, University Hospital, Kantonsspital, Basel, Switzerlandh Department of Radiology, Magnetic Resonance Unit, Hospital Vall d'Hebron, Barcelona, Spaini Department of Clinical Neurosciences, Imperial College, London, UKj Current address: Clinical Imaging Centre, Clinical Pharmacology and Discovery Medicine, GlaxoSmithKline, UK

Abbreviations: EDSS, Extended Disability Status Scorefunctional MRI; MNI, Montreal Neurological Institute;ROI, region of interest.⁎ Corresponding author. GSK Clinical Imaging Centr

Cane Road, London W12 0NN, UK.E-mail address: [email protected] (P.M. Ma

1053-8119/$ – see front matter © 2008 Elsevier Inc. Alldoi:10.1016/j.neuroimage.2008.05.005

a b s t r a c t

a r t i c l e i n f o

Article history:

With expanding potential Received 22 February 2007Revised 26 April 2008Accepted 7 May 2008Available online 15 May 2008

Keywords:Multiple sclerosisFunctional magnetic resonance imagingClinical trialPlasticity

clinical applications of functional magnetic resonance imaging (fMRI) it is im-portant to test how reliable different measures of fMRI activation are between subjects and sessions andbetween centres. This study compared variability across 17 patients with multiple sclerosis (MS) and 22 age-matched healthy controls (HC) in 5 European centres performing an fMRI block design with hand tapping.We recruited subjects from sites using 1.5 T scanners from different manufacturers. 5 healthy volunteers alsowere studied at each of 4 of the centres. We found that reproducibility between runs and sessions for singleindividuals was consistently much greater than between individuals. There was greater run-to-run variabilityfor MS patients than for HC. Measurements of maximum signal change (MSC) appeared to provide higherreproducibility within individuals and greater sensitivity to differences between individuals than region ofinterest (ROI) suprathreshold voxel counts. The variability in measurements between centres was not asgreat as that between individuals. Consistent with these observations, we estimated that power should notbe reduced substantially with use of multi-, as opposed to single-, centre study designs with similar numbersof subjects. Multi-centre interventional studies in which fMRI is used as an outcome measure thus appearpractical even when implemented in conventional clinical environments.

© 2008 Elsevier Inc. All rights reserved.

Introduction

Clinical applications of functionalMRI (fMRI) are expandingrapidly (Matthews et al., 2006). In multiple sclerosis andstroke, for example, fMRI has been employed to identifyaltered networks involved in motor control and to identifypotentially adaptive reorganisation and plasticity in ways thatmay contribute to establishing prognosis (Matthews, 2004;Rocca et al., 2005). In epilepsy surgery, fMRI enables iden-

; EPI, echo planar image; fMRI,PMd, dorsal premotor cortex;

e, Hammersmith Hospital, Du

tthews).

rights reserved.

tification of hemispheric language dominance to aid surgicalplanning (Baciu et al., 2005). FMRI may be useful in predictingsubsequent clinical progression in people at risk of developingAlzheimer's disease (Bookheimer et al., 2000). With this in-creasing use of fMRI in the clinical domain, and the possibilitythat results could influence individual treatment (Bartschet al., 2006), there is a need to assess the reproducibility of fMRIactivation measures in the clinical setting. Expanding clinicalapplications of fMRI also raise the possibility for use of fMRI asan outcomemeasure in multi-centre studies of new therapies.It is therefore important to understand the extent to whichresults can be reproduced across different centres.

There aremany sources of variability in fMRI. At the level ofthe individual subject, the BOLD signal can be affected bypsychological or physiological factors. Caffeine ingestion (Beh-zadi and Liu, 2006; Liu et al., 2004), sleep deprivation (Chee

604 R. Bosnell et al. / NeuroImage 42 (2008) 603–610

and Choo, 2004), or fatigue (Tartaglia et al., 2008), for example,alter the BOLD response. Psychological factors, such as differentlevels of attention to sensory stimuli (Corbetta et al., 1991) ormovement (Johansen-Berg et al., 2000) also modulate brainresponses. Scanner hardware differences may contribute tovariation in results between centres (e.g., from differences infield strength or radiofrequency coil configurations) (Friedmanand Glover, 2006). Factors such as drift of the B0 magnetic fieldand differences in shim quality additionally contribute to day-to-day variationwith use even of the same scanner. Themethodused for analysis can affect the reproducibility of resultsobtained (Smith et al., 2005). A less recognised contribution tothis arises from the substantial structured noise in fMRI data,which can vary from session-to-session, subject-to-subject andcentre-to-centre (Tegeler et al.,1999). Finally, the precise detailsof a paradigm used in a multi-site experiment (e.g., stimulationmethod, response recording, or acquisition parameters) allinfluence results and add to variation unless controlled (Jezzardand Buxton, 2006). Anatomical variations between subjectslimit the precision of analyses based on regions of interest(Uylings et al., 2005).

Previous assessments of variability in fMRI results havebeen carried out predominantly using healthy controls (Cohenand DuBois, 1999; Marshall et al., 2004; Miki et al., 2000;Tjandra et al., 2005). Most studies have established greatervariation in results from fMRI signal changes between sessionsthan within a session (Marshall et al., 2004; McGonigle et al.,2000). Studies that have considered the importance of themethod of signal quantification (Tegeler et al., 1999) haveemphasised the poor test–retest reliability of suprathresholdvoxel counts (Cohen and DuBois, 1999; Marshall et al., 2004).Thirion et al. argue that optimisation of the significance thres-holds can improve reliability and emphasise that inter-in-dividual signal variation is a major factor limiting to power ofgroup studies (Thirion et al., 2007).

Very few studies have compared fMRI measures acrossdifferent sites. An early study performed by Casey et al. com-pared activation during a motor task across four sites (Caseyet al., 1998). Their study demonstrated qualitatively similarpatterns of brain activity obtained at the different scanningsites, but did not include a detailed quantitative analysis of thereproducibility of the fMRI signals. The FBIRN collaboration hasattempted a more comprehensive investigation. A first studyreported significant variability between subjects with greaterreproducibility across sites with higher field strength (3 and4 T relative to 1.5 T) (Zou et al., 2005). In the second report, theconsortium emphasised that without controlling well forinstrument-related factors, site-to-site differences contributesubstantially to variance (Friedman and Glover, 2006). How-ever, Suckling et al. recently presented evidence that ifidentical field strengths are used, inter-site variance made arelatively small contribution to variance in an fMRI studyinvolving two sites (Suckling et al., 2007).

The current study aims to further explore this question byestimating the reproducibility of fMRI using data obtainedfrom five different 1.5 T clinical scanning sites during per-formance of a simple motor task. To test whether compara-ble reproducibility is seen in patients and healthy subjects,we acquired data from patients with multiple sclerosis andfrom an age- and sex-matched group of healthy controls.The experimental design enabled variability across run, ses-sion and centre to be assessed. Testing the reproducibilityof fMRI data across multiple sites provides data that can

help to assess the statistical power gains achieved with largerstudies across multiple sites versus smaller studies in a singlesite. Variance for the same data set quantified in different wayswas assessed by comparing measures based on the suprathres-hold voxel count (SVC) and the peak relative signal change(MSC), both within the primary sensorimotor cortex.

Methods

Subjects

Patients and controls17 patients with multiple sclerosis (mean age, 32 years

[range 19–47 years]; median EDSS 2 [range 0–7.5], 9 femalesand 8 males) and 22 age-matched healthy controls (mean age,35 years [range 24–47 years], 12 females and 10 males) par-ticipated in this study, which was a subset of a larger multi-centre study evaluating the application of fMRI in this diseasearea (Wegner et al., 2008). Centres individually contributedvarying proportions of patients and healthy controls for thisanalysis (London, 5 controls, 5 patients; Oxford, 3 controls, 4patients; Siena 4 controls, 2 patients; Barcelona, 5 controls, 4patients; Amsterdam, 5 controls, 2 patients). To allow inter-sitevariance to be studied, a separate group (n=5) of healthyvolunteers were scanned in at least 4/5 of the MRI centres.The subjects were all right-handed (Edinburgh HandednessInventory) (Oldfield, 1971) and were non-smokers. Patientswere diagnosed with either relapsing-remitting or secon-dary progressive MS and had not had a relapse or requiredcorticosteroids for at least 3 months prior to scanning. TheExpanded Disability Status Scale (EDSS) score (Kurtzke, 1983)at entry to the study was ≤7.5 and the patients did not haveclinically-evident right hand impairment. The subjects wererecruited from five European Centres (London, Oxford, Siena,Graz, Amsterdam). Local ethics approval was obtained at allsites and all subjects gave informed consent.

Imaging parameters

Brain MRI scans were obtained using a magnet operatingat 1.5 T at all sites (Centre 1: Siemens Sonata, Centre 2: SiemensSymphonyMaestro Class, Centre 3: GE Signa Excite 11.0, Centre4: Siemens Sonata, Centre 5: Philips Gyroscan ACS-NT15).Sagittal T1-weighted images were acquired to define theanterior–posterior commissural plane. FMRI were obtainedusing amulti-slice gradient echo planar imaging (EPI) sequence(echo time=60 ms, repetition time=3000 ms, field of view240×240 mm2, matrix 64×64). 21 contiguous axial slices wereacquired parallel to the anterior–posterior commissural plane,with a thickness of 6 mm covering the whole brain during eachmeasurement. Each fMRI session included four fMRI sequenceslasting 6minwith 120 brain volume image acquisitions in each.T1-weighted high resolution MRI scans were acquired tofacilitate anatomical localisation of the functional data throughregistration (resolutionof the T1-weightedMRI: 1×0.5×0.5mmfor centre 2, 1×1×1 mm for centre 4, 1×1.5×1 mm for centre 1and centre 3, and 1×1×3 mm for centre 5).

FMRI paradigm

A visually-cued hand-tapping task was performed using a“block” design (Donaldson and Buckner, 2003) with 6 periodsof the 30 second movement alternating with 30 s rest. This

Table 1Median activation in sensorimotor cortex (SMC) for the first and second scanningsessions across the five centres

A. Maximum peak relative signal change (MSC) (arbitrary units)

Centre Patients Controls

Session 1 Session 2 Session 1 Session 2

1 303 (268–409) 266 (145–770) 176 (102–601) 239 (127–592)2 207 (69–456) 158 (99–230) 243 (119–361) 200 (110–399)3 251 (141–602) 271 (141–710) 324 (174–709) 527 (165–663)4 278 (188–560) 301 (178–433) 257 (115–366) 298 (192–439)5 311 (276–359) 350 (276–466) 274 (103–476) 248 (197–454)

B. Suprathreshold voxel count (SVC) (threshold, z=3.1)

Centre Patients Controls

Session 1 Session 2 Session 1 Session 2

1 96 (32–119) 61 (40–119) 65 (22–86) 81 (42–105)2 74 (0–125) 53 (0–86) 97 (42–122) 82 (27–106)3 96 (41–143) 89 (41–125) 75 (34–104) 92 (0–118)4 79 (57–126) 87 (68–121) 91 (43–117) 69 (49–81)5 86 (0–123) 99 (44–126) 86 (0–123) 99 (44–126)

C. Suprathreshold voxel count (SVC) (threshold, z=2.3)

Centre Patients Controls

Session 1 Session 2 Session 1 Session 2

1 111 (42–130) 82 (57–131) 76 (40–101) 87 (47–119)2 87 (0–132) 67 (0–95) 107 (61–128) 97 (40–117)3 103 (54–145) 103 (54–129) 87 (46–116) 100 (0–127)4 90 (71–136) 87 (68–121) 103 (59–130) 78 (57–96)5 96 (0–128) 105 (54–135) 91 (0–119) 65 (0–116)

Data summarised here were used for variability calculations related to patient andmatched healthy control changes. Median values are given with ranges in parenthesesfor different measures of signal change. As expected, reducing the threshold for the SVCmeasure increases the magnitude of the activation change.

605R. Bosnell et al. / NeuroImage 42 (2008) 603–610

entire sequence was repeated 4 times in each scanningsession, with the subject remaining in the scanner across allrepetitions. Each repeated sequence set will subsequentlybe referred to as a “run” (i.e., there were four runs in onescanning session). All patients and controls underwent twoscanning sessions on each testing day.

The flexion–extensionmotor task involvedmoving the fourfingers of the right hand together inside a wooden frame(identically manufactured for all sites) that restricted themaximum amplitude of the extension to approximately 3 cmfrom the rest position. All centres also were supplied with astandardised metronome equipped with a red flashing LED topace themovements at a 1 Hz frequency. Patients and controlswere trained to follow the LED cue before performing theexperiment. Subjects were monitored visually during thescanning to ensure compliance.

Data analysis

The analysis was carried out at a single site (Oxford) usingtools from the FMRIB Software Library (www.fmrib.ox.ac.uk/fsl). Pre-statistical processing was applied during the first-level (within sequence) analysis:motion correction (Jenkinsonet al., 2002) and spatial smoothing using a Gaussian kernel offull-width half-maximum 8 mm, and non-linear high-passtemporal filtering (Gaussian-weighted LSF straight line fit-ting, with sigma=80.0). Use of such a relatively large Gaussiankernal should reduce variance contributed by inter-scannerdifferences in image smoothness (Friedman and Glover, 2006).Statistical analysis was carried out using the general linearmodel (GLM) with local autocorrelation correction (Woolrich

et al., 2001). Z-stat images were thresholded at z=3.1, p=0.05and z=2.3, p=0.05. Registration of EPI functional images tohigh resolution and standard (Montreal Neurological Insti-tute) space was carried out using affine transformation with12 degrees of freedom (Jenkinson and Smith, 2001). A regionof interest (ROI) was defined manually in the contralateralhand area (Yousry et al., 1997) of the primary sensorimotorcortex in standard space, which was then registered to theindividual functional images on the T1-weighted scan for eachsubject (Jenkinson and Smith, 2001). For each subject, session,and run the number of suprathreshold voxels (SVC) within theROI and the peak relative signal change of the maximallyactivated voxel (MSC) within the ROI were calculated. Threemeasurements (SVC from high and low threshold statisticalimages and MSC) were therefore obtained for each variable.

Statistical analysis

General statistical analyses were performed using the Sta-tistical Package for Social Sciences (SPSS 11.0.4 for OS X, SPSSInc., http://www.spss.com).

An intraclass correlation (ICC) analysis was used as theprimary test of the reliability of measures. An intraclass cor-relation (ρ) was calculated:

ρ ¼ MSBS−MSWS

MSBS þ k−1ð ÞMSWS;

whereMSBS is the between-subject (or centre)mean of squares,MSWS the within subject (or centre) mean of squares and k thenumber of measurements. ICC is bounded between −1 and 1.

To assess reliability between runs, an ICC between allfour runs of the first session was calculated. Patients andcontrols were considered separately in all analyses. To assessreliability between sessions, an ICC was calculated using the firstrun from the two sessions. Finally, an ICC was calculated acrossacross subjects to assess the reproducibility of measuresbetween individuals. An ICC approaching 1 would indicatehigh reproducibility in each case, while a value close to 0indicates very low reproducibility. However, while ICC is ausefulmetric for understanding reproducibility in the context ofgroup studies, values need to be interpreted in the contextof thetotalMS error term, e.g., a low ICC does not necessarily equate topoor reproducibility within individual subjects (reflected inMSws). To assess reliability between centres ICC were calcula-ted using data collected from the subset of healthy controlsable to be studied at multiple centres. An ICC similar to thatdetermined for reliability between sessionswould indicate thatthere was little additional variability contributed by conductingmeasurements across different centres.

Coefficients of varation (CV) also were calculated and arereported in the Appendix:

CV ¼ 100�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi∑ xi −―xð Þ2

n−1

q

―x

where (xi for i=1–4) are the measurements of effect size(number of suprathreshold voxels) from run 1 to 4 of onesubject at a single centre and x is the mean of the four runs.

Between-run CV (calculated from the average of run one tofour) differences were tested and were assessed using aKruskal Wallis test to assess if there was a main effect ofcentre using age- and sex-matched controls. The Kruskal

606 R. Bosnell et al. / NeuroImage 42 (2008) 603–610

Wallis test was run separately for CVs derived from SVCs fromhigh (z=3.1) and low (z=2.3) thresholds and for MSC for thetwo subject groups (controls and patients).

Between-subject variability was assessed based on the firstruns of the first sessions or the mean of runs in the firstsessions. CVs were compared between groups (patients andcontrols) using a Mann–Whitney U-test.

Between-centre variabilitywas assessed in a similar way forthe first run, the first session and across sessions by averagingvalues for each measure across all subjects in a single centreand then calculating a CV from each of these average valuesacross all centres as a summary statistic.

To provide an additional index of variability between ses-sions, we also calculated a normalised difference score(NDS) (see Appendix) as the ratio of activation differences((session 2−session 1)/mean (session 1, session 2)). Thesenormalised difference scores were then used to assess vari-ability between sessions across centres using a Kruskal Wallistest. The between session variation across groups (patientsand controls) was then calculated using a Kruskal Wallistest.

Study power estimations (two-tailed, 0.05 with 1−β=0.8)were performed using a downloadable Java applet (http://www.cs.uiowa.edu/~rlenth/Power/) to calculate numbers ofsubjects required to detect 10, 20 or 30% differences in peakrelative signal change or voxel count based on data fromthe healthy controls. The calculations are based on com-putations of the critical value from a central t-distribution

Fig. 1. Change in suprathreshold voxel count over four consecutive runs during the two sccontrols is plotted against run (first session, black bar second session, grey bar). Standard devand a significant main effect of run [F (3, 84)=4.37, 0.014] and interaction of run by session [FSVC showed similar trends, but only the interaction of run by session was significant [F (3 8

and the probability of the critical region from the non-centralt-distribution (Lenth, 2007).

Results

We tested for factors potentially contributing to fMRIvariance in activation during hand movement within a regionof interest (ROI) confined to the contralateral hand region ofthe primary sensorimotor cortex (SMC). We quantified acti-vation for each run in two different ways: first, by determiningthe maximum relative signal change in suprathreshold SMCvoxels (MSC) and second, by counting the number of supra-threshold voxels (SVC) within the SMC ROI (Table 1). The SVCwere calculated from z-stat images at two different thresholds(z=2.3, z=3.1).

Using data from patients and healthy controls obtainedacross all of the participating centres, the activation measuresfirst were tested for effects of centre (5 centres), group(patients or healthy controls), session (first or second session)or run (4 runs). The 2(group)×2(session)×5(centre)×4 (run)multiway, multivariate ANOVA for MSC showed a significantmain effect of run [F (3 63)=16.85, p=0.001] only. Therewere significant interactions between session and centre [F (4,21)=4.074, p=0.018], run and centre [F (4, 21)=2.6, 0.012] andrun and group [F (4, 21)=3.7, 0.022].

The equivalent ANOVA on the SVC measures (threshold,z=3.1) also revealed a significant main effect of run [F (3,84)=4.37, 0.014]. Using a lower threshold (z=2.3), ANOVA did

anning sessions. The suprathreshold voxel count within the SMC ROI for patients andiations are shown. A main effect of runwas found for (A) MSC [F (3 63)=16.85, p=0.001](3, 84)=3.59, 0.023] were found for SVC (threshold, z=3.1). At a lower threshold (z=2.3),1)=3.726, p=0.014] (data not shown).

Table 2Intraclass correlation coefficients (ICC) between runs, sessions and subjects based ondata using either the MSC or SVC

A. ICC between run, session and across sessions for patients and controls

Measure Between run Between sessionfirst runs only

Between sessionmeans

MSC Controls 0.91 0.83 0.82Patients 0.76 0.62 0.69

SVC (N3.1) Controls 0.57 0.45 0.47Patients 0.28 0.09 0.41

SVC (N2.3) Controls 0.59 0.49 0.46Patients 0.24 −0.06 0.34

B. ICC between centres

Run 1 Run 2 Run 3 Run 4 Session mean of runs

MSC 0.54 0.44 0.39 0.56 0.52SVC (zN2.3) 0.53 0.39 0.32 0.36 0.46SVC (zN3.1) 0.52 0.36 0.31 0.39 0.45

C. ICC between subjects across all sites or within a single site

Measure Subject Mean across sessionsand centres

Mean across sessionwithin centres

MSC Controls 0 0Patients 0 0

SVC (N3.1) Controls 0 0.1Patients 0 −0.1

SVC (N2.3) Controls 0 0.1Patients 0 0.1

Data from patients and matched controls were used to estimate variability within a centreacross runsor sessions (A).Overall, variabilitywasgreater betweenmeasures forpatients thanfor healthy controls and greater when expressed with SVC than MSC. Data from healthyvolunteers studies who were each studied at multiple centres was usedto estimate thebetween-run reproducibility across centres (B). Overall, variability was little different forvalues from thefirst run alone relative to an average across four runs in the session. Variabilityshowed a trend to increase with subsequent runs in a session. Differences across runsappeared lessmarked forMSC than SVC. Finally, data fromall healthy controls and all patientsacross centres was used to estimate the between-subject variation (C). Between-subjectreproducibility was diminishingly low, assessed either across all subjects or across subjectsstudied in individual centres.

Table 3Estimates of sample sizes for potential intervention trials to detect changes in activationin a SVC ROI based on suprathreshold voxel count (SVC) or for the meanmaximum peakrelative signal change (MSC) for a single-centre or for a multi-centre study

Sample size estimates for within- or between-centre studies

Measure Study design Sample size for changes of differentmagnitudes

10% 20% 30%

MSC Within centre 221 57 26Between centre 165 43 20

SVC (zN2.3) Within centre 407 103 47Between centre 373 95 43

SVC (zN3.1) Within centre 297 76 35Between centre 358 91 38

The estimated subject numbers to detect changes of varying sizes (10–30%) based onmeasures of variance made here are calculated for a one-sample, two-tailed test withalpha=0.05, power=0.80.

607R. Bosnell et al. / NeuroImage 42 (2008) 603–610

not show a significant main effect of run, but similar trends inactivation signal change were seen (Fig. 1C).

Sources of variability

We characterised the relative contributions from differentsources to the overall variability between runs for the patientsand healthy controls and between subjects studied within acentre using the intraclass correlation coefficient (ICC). Datafrom 5 healthy subjects studied at 4/5 centres were used toassess the variability between centres relative to that betweensubjects.

Between-run and -session variancesSimilar ICC were found for the relative reproducibility of

activation for single subject relative to between subjectsacross runs in a single (first) session, across (first) runs inthe two sessions and as the means of values across a session(Table 2A). The ICCwas higher for all measures usingMSC thanSVC. There was a consistent trend for lower ICC with patientsthan with healthy controls. The lowest ICC was found for theSVC measure.

Between-subject varianceConsistent with the broad range of activations across both

the healthy control and the patient populations in everycentre (Table 1), regardless of the method of measurement,

ICC between subjects relative to single subjects were dimin-ishingly low (Table 2C), whether assessed across all subjects ina group (patients or healthy controls) or across subjects withina single centre. There was not a meaningful difference inrelative reproducibility of absolute measures across subjectsbetween the two groups.

Between-centre varianceReproducibility between individuals in the assessments

above is confounded by the possible contributions to variancearising from measurement differences between centres. ICCshowed that reproducibility of measures across individualruns or across the means of four runs in the first session weregreater for the individual healthy volunteers studied acrossthe multiple centres than between the individual healthycontrols studied within a single centres (Table 2B, C). SimilarICC values were obtained when either only the first run orwhen the mean of all runs was considered, The ICC weresimilar for MSC and SVC measurements. As found in theanalysis of patients and healthy controls, the ICC did notincrease significantly using mean values over a whole sessionrelative to those from the first run only.

Estimates of power to detect changes in activation

Finally, we asked whether our data provided evidence for asubstantially greater power to detect change using a single- vs.a multi-site study design. We used data from the small,healthy volunteer group who were studied at more than onesite for this estimation. Measurements usingMSC demanded asimilar number of subjects to detect changes for within- andbetween-centre designs (Table 3). The estimated sample sizeswere lower with MSC measurements.

Discussion

We investigated the reproducibility of fMRI activation inthe contralateral SMC in patients with MS and in healthyvolunteers performing simple hand movements. A primarygoal was to evaluate variability across subjects relative to thatacross centres in order to assess the feasibility of multi-centrefMRI protocols for clinical intervention trials. A second goalwas to assess the relative variability of measures for a patientpopulation in comparison to age-matched healthy controls.

We found that inter-individual activation variation greatlydominated over run-to-run or centre-to-centre variation.Substantial variation in absolute BOLD activations have been

608 R. Bosnell et al. / NeuroImage 42 (2008) 603–610

observed consistently before (see, e.g., Tjandra et al., 2005).Some elements of this variation (e.g., contributions fromsubject head movement) can be controlled to some extent.Other elements likely represent physiological differences orinteractions of normal physiology with pathological changes.

With only modest, clinically practical study constraints, meanabsolute activationmeasures were similar across the sites. Whilethe reproducibility of measures across centres was lower thanwithin a centre for a single individual, the reproducibility ofmeasurementsmade on the same subject across different centreswas higher than the measurements between different subjects. Inconsequence, the estimatedpower todetect changes in activationwas similar for multi-centre and single-centre fMRI studies.However, this needs the caveat that the total number of subjectsevaluated for this part of our studywas small. Also,we observed aconsistent trend for lower reproducibility between runs forpatients relative to that for healthy controls; between-run or -session variance for healthy controls and for patients cannotsafely be assumed always to be the same.

Our observation that inter-subject variation dominatesvariance even in a multi-site study is in general agreementwith results from a recent cognitive paradigm study conductedover two sites (Suckling et al., 2007), but appears to contrastwith conclusions of the FBIRN group (Friedman and Glover,2006). The latter group used a task similar to ours to study 5individuals across 10 MRI centres. Overall, between-sitereproducibility was reported to be lower than that withinsites. However, their study identified major differences deter-mined by the scanner field strength, which was controlled inour study. When Friedman and colleagues compared data justacross the four 3 T scanner sites (leaving out data from oneoutlier site), controlled for differences in image smoothness andused a more optimised ROI selection, inter-site ICC rose to 0.58.This suggests fair to good inter-site variance, as reported here.

While averaging greater numbers of runswithin a session canimprove test–retest reliability (Suckling et al., 2007), this has notbeen reported consistently (Friedman and Glover, 2006) and wedid not observe this in our study. Previous studies have foundorder effects with a decrease in SVC from a first to a secondsession, for example (Havel et al., 2005) (although others havefound variously increased or decreased signal, depending on thebrain region assessed) (Loubinoux et al., 2001).We speculate thatthe run-to-run variability (and the trend for patients to showgreater variability) in our study was a consequence of subjectfatigue or lack or sustained attention across the longer protocol.These factors are potentially amenable to some degree ofexperimental control, e.g., by using the minimum possibleparadigm length. However, a practically important conclusionthat can be drawn is that, in some instances, shorter paradigmsmay be usedwithout significant cost to study power.

Our results re-emphasise that the signal measurement me-thodused influences reproducibility.We foundhigher ICC for anindividual subject using MSC relative to SVC, consistent withprevious arguments that high variability is inherent in SVCmeasures (Cohen and DuBois, 1999). These results suggest thatexploration of more sophisticated measures of brain responsesthat simultaneously maximise intra-individual reproducibility(potentially optimising sensitivity to change, e.g., therapeuticresponse) and sensitivity to individual differences (potentiallyoptimising sensitivity to state, e.g., disease) would be useful toimprove future study designs.

There are limitations to our study. Low numbers of subjectswere scanned at each site, limiting confidence in estimates of

variance and power. This also reduced our sensitivity to detectsite-specific differences. However, overall, this is one of thelargest studies of this type to be conducted and the only onethat we are aware of to directly address the issue of re-producibility in a clinical population. A second limitation isthat only one fMRI paradigm was explored. An experimentaladvantage of using the hand-tapping task was that theassociated brain activity has a well-defined functional anat-omy (Yousry et al., 1997) and the task was able to be well-controlled across sites, minimising potential effects of differ-ences in paradigm implementation. It is possible that complexparadigms will be more difficult to standardise, as paradigmsgiving lower activation may show less reproducibility acrosssites and that could relatively favour single site studies. How-ever, one experience reported recently (Suckling et al., 2007)is encouraging that the conclusions reached here may begeneralisable. Finally, additional parameters that could becontrolled to optimise reproducibility (e.g., subject-specificGLM models, addition of physiological covariates to accountfor respiratory and cardiac-related variability (Wise et al.,2004)) were not included in our protocol. If included, thesewould be expected to reduce intra- and inter-subject vari-ances, rather than inter-site variance.

Our study was designed to address an important, prag-matic issue for potential fMRI-based clinical studies. We havetried to put our results in the context of a potential practicalapplication, e.g., for a small clinical therapeutic trial (Johansen-Berg et al., 2002) in which SMC activation is used as an out-comemeasure. Our study suggests that fMRI activation duringa simplemotor task is sufficiently reproducible to detect chan-ges with only modest effect (e.g., 20%) in reasonable-sizedstudy population studied either within or across multiplecentres (Table 3). Finding much lower variability betweencentres relative to that between individuals supports thenotion that fMRI can be employed as an outcome measure inmulti-centre trial patient trials (Suckling et al., 2007). Futurework to further limit factors contributing to centre-to-centrevariation, either through more refined paradigm standardisa-tion or use of strategies to better standardise image acquisi-tions (Friedman and Glover, 2006), could additionally enhancethe potential statistical power of multi-centre fMRI studies.

Acknowledgments

The authors thank the anonymous reviewers who con-tributed so generously to improving this manuscript. PMMthanks the MRC (UK) and the MS Society of Great Britain andNorthern Ireland for the support. The design and preparationof this review were done under the auspices of the EuropeanCommunity network for Magnetic Resonance research in MS(MAGNIMS). PMM and BWare employees of GlaxoSmithKline.

Appendix A. coefficients of variation and normaliseddifference scores

A.1. Coefficients of variation

To more fully characterise variance we performed additionalanalyses. CV were used for initial characterisation of responsevariability. We calculated the CV between runs for individualsubjects within a single session (Appendix Table 1A). No sig-nificant differences in CV were seen between the differentmeasurements (MSC or SVC) or between patients and controls.

609R. Bosnell et al. / NeuroImage 42 (2008) 603–610

A.2. Between-run and -session variances

The CV between runs for individual subjects (AppendixTable 1A) were lower than between subjects (AppendixTable 1B). CV calculated between subjects from across thefirst run or mean values for the first session or across sessionsgenerally were similar (Table 2B). CV for MSC were consis-tently larger than were those for SVC.

A.3. Normalised difference scores

We further explored reproducibility of activation averagedacross runs and sessions using the normalised difference score(NDS) (Appendix Table 2) NDS was not significantly differentbetween measures (MSC or SVC), however, there was a trendfor a higher NDS with higher SVC threshold. No meaningfulreduction of NDS was found using the mean across a sessionrelative to the first run alone. There was no significant dif-ference in NDS between subject groups.

Table A1Coefficient of variation of signal change between subjects, centres and runs, using themeasurements of maximum peak relative signal change (MSC) or suprathreshold voxelcount (SVC)

A

Between-run coefficients of variation

Measure

Controls Patients

MSC

14(11–20) 15(6–43) SVC (zN3.1) 18(10–27) 19(13–48) SVC (zN2.3) 15(9–26) 15(11–45)

B

Between-subject coefficients of variation

Measure Single run Mean across first session

MSC

Controls 41(22–62) 47(26–69) Patients 42(20–49) 36(9–47)

SVC (ZN3.1)

Controls 22(17–32) 27(11–32) Patients 34(33–60) 24(11–65)

SVC (ZN2.3)

Controls 19(13–31) 21(9–29) Patients 28(27–46) 21(9–55)

Median values are shownwith ranges in parentheses for between-run (A) or between-subject (B) variability. Between-subject variability is greater than that between runsfor individual subjects. The median values for CV across centres are given in (C).

Table A2Average normalised difference scores (NDS) across centres based on mean of four runsor with a single run

Single run

Mean across firstsession

MSC

Controls 0.20 (0.00–0.72) 0.21 (0.00–0.68) Patients 0.09 (0.00–0.71) 0.08 (0.00–0.66)

SVC (zN3.1)

Controls 0.12 (0–1.08) 0.15 (0.02–1.37) Patients 0.14 (0–2) 0.06 (0–0.70)

SVC (zN2.3)

Controls 0.18 (0–2) 0.19 (0.04–1.42) Patients 0.21 (0–2) 0.11 (0–0.87)

Data from the first session only was used with the first run chosen for single runcomparisons.Median values are shownwith ranges inparenthesis. Abbreviations:NDS =normalised difference scores (see Results); MSC = maximum peak signal change; SVC =suprathreshold voxel count.

References

Baciu, M.V., Watson, J.M., Maccotta, L., McDermott, K.B., Buckner, R.L., Gilliam, F.G.,Ojemann, J.G., 2005. Evaluating functional MRI procedures for assessing hemi-

spheric language dominance in neurosurgical patients. Neuroradiology 47,835–844.

Bartsch, A.J., Homola, G., Biller, A., Solymosi, L., Bendszus, M., 2006. Diagnosticfunctional MRI: Illustrated clinical applications and decision-making. J. Magn.Reson. Imaging 23, 921–932.

Behzadi, Y., Liu, T.T., 2006. Caffeine reduces the initial dip in the visual BOLD response at3 T. Neuroimage 32, 9–15.

Bookheimer, S.Y., Strojwas, M.H., Cohen, M.S., Saunders, A.M., Pericak-Vance, M.A.,Mazziotta, J.C., Small, G.W., 2000. Patterns of brain activation in people at risk forAlzheimer's disease. N. Engl. J. Med. 343, 450–456.

Casey, B.J., Cohen, J.D., O'Craven, K., Davidson, R.J., Irwin, W., Nelson, C.A., Noll, D.C., Hu,X., Lowe, M.J., Rosen, B.R., Truwitt, C.L., Turski, P.A., 1998. Reproducibility of fMRIresults across four institutions using a spatial working memory task. Neuroimage 8,249–261.

Chee, M.W., Choo, W.C., 2004. Functional imaging of working memory after 24 hr oftotal sleep deprivation. J. Neurosci. 24, 4560–4567.

Cohen, M.S., DuBois, R.M., 1999. Stability, repeatability, and the expression of signalmagnitude in functional magnetic resonance imaging. J. Magn. Reson. Imaging 10,33–40.

Corbetta, M., Miezin, F.M., Dobmeyer, S., Shulman, G.L., Petersen, S.E., 1991. Selective anddivided attention during visual discriminations of shape, color, and speed:functional anatomy by positron emission tomography. J. Neurosci. 11, 2383–2402.

Donaldson, D.I., Buckner, R.L., 2003. Effective paradigm design. In: Jezzard, P., Matthews,P.M., Smith, S.M. (Eds.), Functional MRI, an Introduction to Methods. OxfordUniversity Press, Oxford, pp. 177–197.

Friedman, L., Glover, G.H., 2006. Reducing interscanner variability of activation in amulticenter fMRI study: controlling for signal-to-fluctuation-noise-ratio (SFNR)differences. Neuroimage 33 (2), 471–481.

Havel, P., Braun, B., Rau, S., Tonn, J.C., Fesl, G., Bruckmann, H., Ilmberger, J., 2005.Reproducibility of activation in four motor paradigms. An fMRI study. J. Neurol. 253,471–476.

Jenkinson, M., Smith, S., 2001. A global optimisation method for robust affineregistration of brain images. Med. Image Anal. 5 (2), 143–156.

Jenkinson, M., Bannister, P., Brady, M., Smith, S., 2002. Improved optimization for therobust and accurate linear registration and motion correction of brain images.Neuroimage 17 (2), 825–841.

Jezzard, P., Buxton, R.B., 2006. The clinical potential of functional magnetic resonanceimaging. J. Magn. Reson. Imaging 23, 787–793.

Johansen-Berg, H., Christensen, V., Woolrich, M., Matthews, P.M., 2000. Attention totouch modulates activity in both primary and secondary somatosensory areas.Neuroreport 11 (6), 1237–1241.

Johansen-Berg, H., Dawes, H., Guy, C., Smith, S.M., Wade, D.T., Matthews, P.M., 2002.Correlation between motor improvements and altered fMRI activity afterrehabilitative therapy. Brain 125 (Pt 12), 2731–2742.

Kurtzke, J.F., 1983. Rating neurologic impairment in multiple sclerosis: an expandeddisability status scale (EDSS). Neurology 33, 1444–1452.

Lenth, R.V., 2007. Statistical power calculations. J. Anim. Sci. 85 (13 Suppl), E24–E29.Liu, T.T., Behzadi, Y., Restom, K., Uludag, K., Lu, K., Buracas, G.T., Dubowitz, D.J., Buxton, R.B.,

2004. Caffeine alters the temporal dynamics of the visual BOLD response. Neuroimage23, 1402–1413.

Loubinoux, I., Carel, C., Alary, F., Boulanouar, K., Viallard, G., Manelfe, C., Rascol, O., Celsis,P., Chollet, F., 2001. Within-session and between-session reproducibility of cerebralsensorimotor activation: a test–retest effect evidenced with functional magneticresonance imaging. J. Cereb. Blood Flow Metab. 21, 592–607.

Marshall, I., Simonotto, E., Deary, I.J., Maclullich, A., Ebmeier, K.P., Rose, E.J.,Wardlaw, J.M., Goddard, N., Chappell, F.M., 2004. Repeatability of motor andworking-memory tasks in healthy older volunteers: assessment at functionalMR imaging. Radiology 233, 868–877.

Matthews, P.M., 2004. An update on neuroimaging of multiple sclerosis. Curr. Opin.Neurol. 17, 453–458.

Matthews, P.M., Honey, G.D., Bullmore, E.T., 2006. Applications of fMRI in translationalmedicine and clinical practice. Nat. Rev. Neurosci. 7 (9), 732–744.

McGonigle, D.J., Howseman, A.M., Athwal, B.S., Friston, K.J., Frackowiak, R.S., Holmes, A.P.,2000. Variability in fMRI: an examination of intersession differences. Neuroimage 11,708–734.

Miki, A., Raz, J., van Erp, T.G., Liu, C.S., Haselgrove, J.C., Liu, G.T., 2000. Reproducibility ofvisual activation in functional MR imaging and effects of postprocessing. AJNR Am.J. Neuroradiol. 21, 910–915.

Oldfield, R.C., 1971. The assessment and analysis of handedness: the Edinburghinventory. Neuropsychologia 9, 97–113.

Rocca, M.A., Colombo, B., Falini, A., Ghezzi, A., Martinelli, V., Scotti, G., Comi, G., Filippi,M., 2005. Cortical adaptation in patients with MS: a cross-sectional functional MRIstudy of disease phenotypes. Lancet Neurol. 4, 618–626.

Smith, S.M., Beckmann, C.F., Ramnani, N., Woolrich, M.W., Bannister, P.R., Jenkinson, M.,Matthews, P.M., McGonigle, D.J., 2005. Variability in fMRI: a re-examination ofinter-session differences. Hum. Brain Mapp. 24, 248–257.

Suckling, J., Ohlssen, D., Andrew, C., Johnson, G., Williams, S.C., Graves, M., Chen, C.H.,Spiegelhalter, D., Bullmore, E., 2007. Components of variance in a multicentrefunctional MRI study and implications for calculation of statistical power. Hum.Brain Mapp. [Epub ahead of print].

Tartaglia, M.C., Narayanan, S., Arnold, D.L., 2008. Mental fatigue alters the pattern andincreases the volume of cerebral activation required for a motor task in multiplesclerosis patients with fatigue. Eur. J. Neurol. 15 (4), 413–419.

Tegeler, C., Strother, S.C., Anderson, J.R., Kim, S.G., 1999. Reproducibility of BOLD-basedfunctional MRI obtained at 4 T. Hum. Brain Mapp. 7 (4), 267–283.

Thirion, B., Pinel, P., Meriaux, S., Roche, A., Dehaene, S., Poline, J.B., 2007. Analysis of a

610 R. Bosnell et al. / NeuroImage 42 (2008) 603–610

large fMRI cohort: statistical and methodological issues for group analyses.Neuroimage 35, 105–120.

Tjandra, T., Brooks, J.C., Figueiredo, P., Wise, R., Matthews, P.M., Tracey, I., 2005.Quantitative assessment of the reproducibility of functional activation measuredwith BOLD and MR perfusion imaging: implications for clinical trial design.Neuroimage 27, 393–401.

Uylings, H.B., Rajkowska, G., Sanz-Arigita, E., Amunts, K., Zilles, K., 2005. Consequencesof large interindividual variability for human brain atlases: converging macro-scopical imaging and microscopical neuroanatomy. Anat. Embryol. (Berl) 210 (5-6),423–431.

Wegner, C., Filippi, M., Korteweg, T., Beckmann, C., Ciccarelli, O., De Stefano, N., Enzinger,C., Fazekas, F., Agosta, F., Gass, A., Hirsch, J., Johansen-Berg, H., Kappos, L., Barkhof, F.,Polman, C., Mancini, L., Manfredonia, F., Marino, S., Miller, D.H., Montalban, X.,Palace, J., Rocca, M., Ropele, S., Rovira, A., Smith, S., Thompson, A., Thornton, J.,Yousry, T., Matthews, P.M., 2008. Relating functional changes during hand

movement to clinical parameters in patients with multiple sclerosis in a multi-centre fMRI study. Eur. J. Neurol. 15 (2), 113–122.

Wise, R.G., Ide, K., Poulin, M.J., Tracey, I., 2004. Resting fluctuations in arterial carbondioxide induce significant low frequency variations in BOLD signal. Neuroimage 21(4), 1652–1664.

Woolrich, M.W., Ripley, B.D., Brady, M., Smith, S.M., 2001. Temporal autocorrelation inunivariate linear modeling of FMRI data. Neuroimage 14 (6), 1370–1386.

Yousry, T.A., Schmid, U.D., Alkadhi, H., Schmidt, D., Peraud, A., Buettner, A., Winkler, P.,1997. Localization of the motor hand area to a knob on the precentral gyrus. A newlandmark. Brain 120 (Pt 1), 141–157.

Zou, K.H., Greve, D.N., Wang, M., Pieper, S.D., Warfield, S.K., White, N.S., Manandhar, S.,Brown, G.G., Vangel, M.G., Kikinis, R., Wells III, W.M., 2005. Reproducibility offunctional MR imaging: preliminary results of prospective multi-institutionalstudy performed by Biomedical Informatics Research Network. Radiology 237,781–789.