Parallel but not Equivalent (Alden Gross)

17
This article was downloaded by: [Johns Hopkins University] On: 07 May 2012, At: 09:55 Publisher: Psychology Press Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of Clinical and Experimental Neuropsychology Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/ncen20 Parallel but not equivalent: Challenges and solutions for repeated assessment of cognition over time Alden L. Gross b c , Sharon K. Inouye b c , George W. Rebok a e , Jason Brandt a e , Paul K. Crane f , Jeanine M. Parisi a , Doug Tommet b , Karen Bandeen-Roche d , Michelle C. Carlson a , Richard N. Jones b c & for the Alzheimer's Disease Neuroimaging Initiative a Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA b Institute for Aging Research, Hebrew SeniorLife, Boston, MA, USA c Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA d Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA e Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, USA f Department of Internal Medicine, University of Washington, Seattle, WA, USA Available online: 30 Apr 2012 To cite this article: Alden L. Gross, Sharon K. Inouye, George W. Rebok, Jason Brandt, Paul K. Crane, Jeanine M. Parisi, Doug Tommet, Karen Bandeen-Roche, Michelle C. Carlson, Richard N. Jones & for the Alzheimer's Disease Neuroimaging Initiative (2012): Parallel but not equivalent: Challenges and solutions for repeated assessment of cognition over time, Journal of Clinical and Experimental Neuropsychology, DOI:10.1080/13803395.2012.681628 To link to this article: http://dx.doi.org/10.1080/13803395.2012.681628 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for

Transcript of Parallel but not Equivalent (Alden Gross)

This article was downloaded by: [Johns Hopkins University]On: 07 May 2012, At: 09:55Publisher: Psychology PressInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: MortimerHouse, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Clinical and ExperimentalNeuropsychologyPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/ncen20

Parallel but not equivalent: Challenges andsolutions for repeated assessment of cognitionover timeAlden L. Gross b c , Sharon K. Inouye b c , George W. Rebok a e , Jason Brandt ae , Paul K. Crane f , Jeanine M. Parisi a , Doug Tommet b , Karen Bandeen-Roched , Michelle C. Carlson a , Richard N. Jones b c & for the Alzheimer's DiseaseNeuroimaging Initiativea Department of Mental Health, Johns Hopkins Bloomberg School of Public Health,Baltimore, MD, USAb Institute for Aging Research, Hebrew SeniorLife, Boston, MA, USAc Department of Medicine, Beth Israel Deaconess Medical Center, Harvard MedicalSchool, Boston, MA, USAd Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health,Baltimore, MD, USAe Department of Psychiatry and Behavioral Sciences, Johns Hopkins UniversitySchool of Medicine, Baltimore, MD, USAf Department of Internal Medicine, University of Washington, Seattle, WA, USA

Available online: 30 Apr 2012

To cite this article: Alden L. Gross, Sharon K. Inouye, George W. Rebok, Jason Brandt, Paul K. Crane, Jeanine M.Parisi, Doug Tommet, Karen Bandeen-Roche, Michelle C. Carlson, Richard N. Jones & for the Alzheimer's DiseaseNeuroimaging Initiative (2012): Parallel but not equivalent: Challenges and solutions for repeated assessment ofcognition over time, Journal of Clinical and Experimental Neuropsychology, DOI:10.1080/13803395.2012.681628

To link to this article: http://dx.doi.org/10.1080/13803395.2012.681628

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes. Any substantial orsystematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distributionin any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that thecontents will be complete or accurate or up to date. The accuracy of any instructions, formulae, anddrug doses should be independently verified with primary sources. The publisher shall not be liable for

any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever causedarising directly or indirectly in connection with or arising out of the use of this material.

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

JOURNAL OF CLINICAL AND EXPERIMENTAL NEUROPSYCHOLOGY2012, iFirst, 1–15

Parallel but not equivalent: Challenges and solutions forrepeated assessment of cognition over time

Alden L. Gross2,3, Sharon K. Inouye2,3, George W. Rebok1,5, Jason Brandt1,5,Paul K. Crane6, Jeanine M. Parisi1, Doug Tommet2, Karen Bandeen-Roche4,Michelle C. Carlson1, Richard N. Jones2,3, for the Alzheimer’s Disease NeuroimagingInitiative∗

1Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD,USA2Institute for Aging Research, Hebrew SeniorLife, Boston, MA, USA3Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston,MA, USA4Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA5Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine,Baltimore, MD, USA6Department of Internal Medicine, University of Washington, Seattle, WA, USA

Objective. Analyses of individual differences in change may be unintentionally biased when versions of aneuropsychological test used at different follow-ups are not of equivalent difficulty. This study’s objective wasto compare mean, linear, and equipercentile equating methods and demonstrate their utility in longitudinalresearch.Study design and setting: The Advanced Cognitive Training for Independent and Vital Elderly (ACTIVE,

∗Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database(adni.loni.ucla.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or pro-vided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.ucla.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf

Alden L. Gross was supported by a National Institutes of Health Translational Research in Aging fellowship (T32AG023480-07). Sharon K. Inouye holds the Milton and Shirley F. Levy Family Chair in Alzheimer’s Disease. This work was supported inpart by Grant P01AG031720 (S.K.I.) from the National Institute of Aging. George W. Rebok is an investigator with Compact DiscIncorporated for the development of an electronic version of the ACTIVE memory intervention. He has received no financial sup-port from them for ACTIVE. Jason Brandt receives royalty income from Psychological Assessment Resources, Inc., on sales of theHopkins Verbal Learning Test–Revised. George W. Rebok’s and Jason Brandt’s relationships are managed by the Johns HopkinsUniversity according to its established conflict of interest policies. The ACTIVE intervention trials are supported by grants from theNational Institute on Aging and the National Institute of Nursing Research to Hebrew Senior Life (U01NR04507), Indiana UniversitySchool of Medicine (U01NR04508), Johns Hopkins University (U01AG14260), New England Research Institutes (U01AG14282),Pennsylvania State University (U01AG14263), the University of Alabama at Birmingham (U01 AG14289), and the University of Florida(U01AG14276). Data collection and sharing for this project were funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI;National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute ofBiomedical Imaging and Bioengineering, and through generous contributions from the following: Abbott; Alzheimer’s Association;Alzheimer’s Drug Discovery Foundation; Amorfix Life Sciences Ltd.; AstraZeneca; Bayer HealthCare; BioClinica, Inc.; Biogen IdecInc.; Bristol-Myers Squibb Company; Eisai Inc.; Elan Pharmaceuticals Inc.; Eli Lilly and Company; F. Hoffmann-La Roche Ltd and itsaffiliated company Genentech, Inc.; GE Healthcare; Innogenetics, N.V.; Janssen Alzheimer Immunotherapy Research & Development,LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Medpace, Inc.; Merck & Co., Inc.; Meso Scale Diagnostics,LLC.; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Servier; Synarc Inc.; and Takeda Pharmaceutical Company. The CanadianInstitutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated bythe Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute forResearch and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California, SanDiego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of California, Los Angeles. This researchwas also supported by National Institutes of Health (NIH) Grants P30 AG010129 and K01 AG030514, and by the Dana Foundation.

Address correspondence to Alden L. Gross, Institute for Aging Research, Hebrew SeniorLife, 1200 Center Street, Boston, MA 02131,USA. (E-mail: [email protected]).

© 2012 Psychology Press, an imprint of the Taylor & Francis Group, an Informa business

http://www.psypress.com/jcen http://dx.doi.org/10.1080/13803395.2012.681628

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

2 GROSS ET AL.

N = 1,401) study is a longitudinal randomized trial of cognitive training. The Alzheimer’s Disease NeuroimagingInitiative (ADNI, n = 819) is an observational cohort study. Nonequivalent alternate versions of the AuditoryVerbal Learning Test (AVLT) were administered in both studies. Results. Using visual displays, raw and mean-equated AVLT scores in both studies showed obvious nonlinear trajectories in reference groups that should showminimal change and poor equivalence over time (ps ≤ .001), and raw scores demonstrated poor fits in models ofwithin-person change (root mean square errors of approximation, RMSEAs > 0.12). Linear and equipercentileequating produced more similar means in reference groups (ps ≥ .09) and performed better in growth models(RMSEAs < 0.05). Conclusion. Equipercentile equating is the preferred equating method because it accommodatestests more difficult than a reference test at different percentiles of performance and performs well in models ofwithin-person trajectory. The method has broad applications in both clinical and research settings to enhance theability to use nonequivalent test forms.

Keywords: Equating; Equipercentile; Neuropsychology; Longitudinal analysis; Alternate forms; Parallel forms.

INTRODUCTION

The identification of cognitive decline requiresrepeated assessment of the same individual overtime. Alternate forms of a test are often used tominimize practice effects under the assumption thatthe forms are equivalent. Thus, the ability to equatealternate forms of neuropsychological tests is highlyimportant in both clinical and research settings.Importantly, inferences about differences betweengroups, or change within persons over time, may beerroneous or biased when versions of a test are notequivalent.

Methods for test equating are particularlyimportant for neuropsychological assessments inpopulation-based research and clinical practice.Alternate versions of a test are considered equiv-alent if they produce the same mean and variancein a sample (Borsboom, 2005). Tests that are par-allel but nonequivalent could be similar in contentand length, but yield different scores for the sameindividual because they differ in item difficulty oradministration characteristics. A relevant exampleis provided by word list-learning tests, for which useof alternate forms is common. Word lists in partic-ular have an extensive history in Western psychol-ogy (Ebbinghaus, 1895/1964; Underwood, 1963).Different versions are intended to be equivalent,but often are not. One reason for nonequivalenceis test length. For example, tests are often short-ened due to time limitations or to reduce partici-pant burden, as was done for the California VerbalLearning Test (CVLT; Mitrushina, Boone, Razani,& D’Elia, 2005), and equipercentile equating meth-ods were used to equate a nine-item version withthe original 16-item version. Other reasons fornonequivalence of forms are more complicated.In word list-learning tests, the word’s frequency ofuse in the English language (Anderson & Bower,1972; Battig & Montague, 1969; Fuller, Gouvier, &Savage, 1997), number of syllables, serial position

in the list, and imagery value (Paivio, 1968) allcontribute to its difficulty. Fuller and colleagues(1997), for example, reported that two forms of theAuditory Verbal Learning Test (AVLT, describedin the Method section), Lists B and C (see Lezak,Howieson, & Loring, 2004), are not of equivalentdifficulty. In a systematic review, Hawkins and col-leagues (Hawkins, Dean, & Pearlson, 2004) showedrecall differences of about three words (8% differ-ence) across alternate forms that were studied.

Test equating is an analytical approach to adjustnonequivalent tests. The goal of test equating is todefine a transformation of a variable that returnsthe same cumulative probability plot as the othervariable being compared. There are many ways toequate tests. Given relevant characteristics of twodistributions, equating methods can be applied inalmost any setting in which multiple scores for eachperson are on different metrics. Two tests that mea-sure the same outcome are equivalent if they placeindividuals in the same relative position in a group(Livingston, 2004). Widely used methods includemean, linear, and equipercentile equating. Thesemethods differ by how relative position is defined.In mean equating, relative position is defined bythe absolute difference from the sample mean ofa test, and each individual’s score is changed bythe same amount to equate the sample mean tothat of a reference test (Kolen & Brennan, 1995).In linear equating, relative position is defined interms of standard deviations from the group mean.Linear equating is accomplished by adjusting scoresfrom the new form to be within the same numberof standard deviations of the mean of the originalform. A formula is provided in the Method section.Equipercentile equating defines relative position bya score’s percentile rank in the group. It is accom-plished by identifying scores on two measures withthe same percentile rank and transforming the scoreon a new test to the corresponding score on thereference form with the same percentile rank.

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

PARALLEL BUT NOT EQUIVALENT COGNITIVE TESTS 3

The reason for administering alternate versionsof a word list is to reduce retest or practice effects,but alternate forms do not completely correct forretest. Practice effects are attributable to generaltesting factors that arise from repeated exposureto the same task (e.g., learning to take tests) inaddition to retention of particular test content(e.g., recall for specific words; Crawford, Stewart, &Moore, 1989). Verbal learning tests are a mainstayin memory research and assessment, but are partic-ularly susceptible to practice effects under repeatedadministration, such as in longitudinal research andin clinical settings when a patient must be reevalu-ated over time. Alternate forms cannot account forgeneral testing factors that contribute to practiceeffects and may actually introduce greater problemsin examining within-person trajectories in longitu-dinal studies if forms are not equivalent.

The objectives of the present study were to com-pare three equating methods—mean, linear, andequipercentile equating—and demonstrate theirutility in longitudinal research. To demonstrate thegeneralizability of the approaches, examples arepresented from two large studies of older adults: acognitive intervention study and an observationalstudy of predictors of conversion to Alzheimer’sdisease. Equating methods were contrasted usingvisual displays, tests of mean equivalence over timein reference groups, and with models of person-levelgrowth.

METHOD

Study samples

Participants were drawn from two large-scale,multi-site cohorts of older adults: the AdvancedCognitive Training for Independent and VitalElderly (ACTIVE) study and the Alzheimer’sDisease Neuroimaging Initiative (ADNI). Thesestudies were selected because both provide lon-gitudinal data using word list-learning measurescollected using similar methods. Although the dataare similar, and nonequivalent forms pose similarchallenges, the objectives of these studies differ con-siderably, which highlights the generalizability ofequating approaches.

ACTIVE is a longitudinal randomized trial ofcognitive training in cognitively intact, community-dwelling adults age 65 and older (Ball et al., 2002;Jobe et al., 2001; Willis et al., 2006). Participants(n = 2,802) were randomized to one of fourintervention groups (memory, reasoning, speedof cognitive processing, and no-contact control)after a baseline assessment and were followed up

immediately after training and after 1, 2, 3, 5, and10 years. For parsimony, the present study useddata from two of these groups, the memory-trained(n = 703) and no-contact control (n = 698) groups,which were collected at baseline before the interven-tion, immediately following training, and at the 1-,2-, 3-, and 5-year follow-up assessments.

ADNI began in 2003 as a five-year observationalcohort study of Alzheimer’s disease (AD), withthe primary goal of assessing the extent to whichserial magnetic resonance imaging, positron emis-sion tomography, other biological markers, andcognitive tests can be used to predict progression tomild cognitive impairment (MCI) and AD. Furtherinformation is available at http://www.loni.ucla.edu/ADNI. The present study used baseline and 6-,12-, 18-, 24-, and 36-month follow-up data for nor-mal subjects (n = 229) and MCI (n = 397) and AD(n = 193) patients. MCI patients were assessed at allwaves. Normal healthy controls were not followedat 18 months, and AD patients were not followedat the 18- or 36-month waves. Data from a 48-month wave were not included in the present studybecause data collection was still underway. Data,which are continuously updated, were downloadedfor the present study on March 15, 2011.

Measures

The AVLT (Rey, 1964; Schmidt, 2004) was admin-istered in both the ACTIVE and the ADNI studies.During administration of the AVLT, participantsare read a list of 15 unrelated words and are askedto recall as many words as they can remember. Thesame list is repeated over five trials, followed by aninterference trial with a new 15-word list, a short-delay free recall trial, and a long-delay free recalltrial 30 min later. The present study used the sumof recall from the five immediate AVLT recall tri-als. The administration of the AVLT in ACTIVEdiffered from that in ADNI in two ways. First,the long-delay free recall trial was dropped dueto time constraints. Second, the ACTIVE proto-col modified the test for group administration byhaving participants write down responses instead ofspeaking them.

In ACTIVE, the original AVLT List A and theinterference list B (Taylor, 1959) were used at thebaseline and third annual visits, lists by Geffenand colleagues (Geffen, Butterworth, & Geffen,1994) were used in the immediate posttraining andfifth annual visits, lists by Crawford and colleagues(1989) were used at the first annual visit, and listsdescribed by Jones-Gotman and colleagues wereused for the second annual visit (Lezak et al., 2004,

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

4 GROSS ET AL.

pp. 423). In ADNI, lists from Taylor (1959) wereused at the baseline and 12- and 24-month vis-its, and lists by Crawford and colleagues (1989)were used during 6-, 18-, and 36-month visits.These visits are delineated in tables and figures withletters.

Statistical analyses

To account for differences in test difficulty, we con-ducted mean, linear, and equipercentile equating inACTIVE and ADNI separately following similarprocedures (Kolen & Brennan, 1995). We adaptedweighted versions of each equating procedure topreserve aging, cohort, and group differences usinga two-stage approach. In the first stage, we selectedan equating sample from which to collect neces-sary characteristics of test distributions and derivethe equating algorithm. The goal of this stagewas to define a sample of participants at eachfollow-up visit whose memory ability was equiva-lent, such that any differences between visits couldbe attributed to form differences and not aging orgroup differences. In the second stage, we appliedthe equating algorithm to the full study samplein a way that preserved attrition, aging, cohort,and group differences but eliminated form differ-ences. Equated scores were then compared visuallyusing plots of mean recall over time and cumula-tive probability plots and statistically using tests ofequivalence of means in reference groups as wellas estimates of within-person change using latentgrowth models.

Stage 1: Defining the equating sample

An important assumption underlying any appli-cation of equating is that the populations produc-ing responses on differently scaled tests at eachtime point must have the same underlying abil-ity. To preserve differences in memory performanceattributable to attrition, normal aging, and trainingstatus or diagnostic group, the equating sample wasrestricted in ACTIVE to control participants andin ADNI to MCI patients, respectively. Althoughthe healthy control group in ADNI was the pre-ferred reference group, the MCI group was usedas the equating sample because by design ADNIassessed them at all waves, and they provided abetter coverage of AVLT scores observed acrossboth healthy control and AD groups. Although theADNI MCI group served as the equating sample,we subsequently restandardized the entire ADNIsample to make the healthy control group thereference sample by subtracting a model-impliedmean difference in equated performance at each

study visit between healthy controls and other par-ticipants. Equated performance at each time pointwas estimated from a weighted mixed-effects modelof test scores on indicators for time. An annotatedcode is available from the authors upon request.

To adjust for attrition over time, we used inversepropensity score weights in the equating sampleto model the probability of dropout (Rosenbaum& Rubin, 1983). To preserve normal aging effectswhen the equating algorithm was applied to the fullsample, we further restricted the equating sampleto participant visits with ages that were commonacross all visits (ages 72 to 90 years in ACTIVEand 63 to 90 years in ADNI). Because ACTIVEand ADNI included long follow-up periods, we esti-mated analytic weights using a direct adjustmentprocedure for age to ensure the same age distribu-tion at each study visit. This removes aging effectsin the equating sample, but preserves aging in thefull sample for individual differences analyses usingequated test scores.

After procedures in this first stage, the only dif-ferences in memory performance between groupsshould be those attributable to form differences.These form differences are addressed to varyingdegrees with equating algorithms, key elements ofwhich were carried over to be used on the fullsample in the second stage.

Stage 2: Apply equating algorithms

Once equating samples were selected in ACTIVEand ADNI, equating algorithms were derived andapplied to the full samples. These algorithmsinvolve means at each study wave for mean equat-ing (described earlier), means and standard devia-tions for linear equating, and test score percentilesfor equipercentile equating. Linear equating wasaccomplished using the following formula:

Test2adjusted = (SDTest1/SDTest2)

× (Test2i − Test2) + Test1

Here, Test2adjusted is the linear-equated AVLT scorefor a follow-up Test2 visit for person i. The rawTest2 distribution has mean Test2 and standarddeviation SDTest2. The baseline test has mean Test1and standard deviation SDTest1. The means anddeviations were identified in the equating samplebut, as for all equating methods, were applied to allparticipants in the second stage of equating.

The goal of equipercentile equating is to definea nonparametric transformation of one variable, inthe present study a follow-up AVLT recall score,that returns the same cumulative probability plotas that in the baseline test. Test scores at follow-up

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

PARALLEL BUT NOT EQUIVALENT COGNITIVE TESTS 5

visits were scaled to baseline scores with the samepercentile rank. For example, a score of 47 on thebaseline AVLT in ACTIVE had a percentile rankof 43.7%. The first annual AVLT score with thatpercentile ranking is 44, demonstrating that for thisscore range, the first annual test was more difficultthan the baseline test by three words, assuming indi-viduals did not truly decline after only 10 weeks.This example is oversimplified because the adap-tation used for the present study does accommo-date normal age-related decline. Additionally, ourequipercentile equating algorithm used a log–linearfunction to smooth out equated score distributions(Albano, 2011).

Evaluation of equating methods with visualdisplays

Line graphs showing mean AVLT scores overtime, or time trend plots, and cumulative proba-bility plots were constructed to compare equatedscores. Equivalent tests should yield identical cumu-lative probability plots in the reference sample.To account for data missing at random condi-tional on indicators for time and group, estimatedmeans from random-effects models were used intime trend plots. Cumulative probability plots showthe cumulative proportion of the sample (y-axis)who recalled up to a given number of words on theAVLT (x-axis).

Evaluation of equating methods with testsof mean equivalence over time

The equivalence of test score means in referencegroups (ACTIVE control, ADNI healthy controls)was tested for raw, mean, linear, and equipercentileequated scores using χ2 tests for nested confir-matory factor analysis models. In the first of twomodels for each equating set, trial recall meansat each study visit were constrained to be equal.In the second model, means were freely estimated.Twice the difference in the log likelihood follows achi-squared distribution. These tests are similar torepeated measures analysis of variance (ANOVA)that tests for differences in means over time, butthey are less stringent because they do not makethe assumption that variances around the means areequal at each visit.

Evaluation of equating methods withmodels of within-person longitudinaltrajectories

We used multiple group latent growth modelsto model person-level changes in recall over time(McArdle & Bell, 2000; Muthén, 1997; Muthén &

Curran, 1997). Latent factors represent initial orbaseline status and trajectories of change over time.These parameters are formed from observed scoresat each study visit. We fixed factor loading pathsfrom the intercept to observed recall sum scoresfor each assessment at 1, and factor loadings fromthe latent slope to values corresponding to a lineartrajectory in time. In ACTIVE, a second interceptfactor was also included to accommodate immedi-ate training gains between baseline and posttrainingfor trained participants (Bollen & Curran, 2006).We constrained its factor loadings to 0 at thebaseline visit and to 1 at follow-up visits and itsvariance to 0.

Latent growth curve models and factor anal-yses were conducted using the Mplus (Version6.11) software package (Muthén & Muthén,1998–2010). The models accommodate data miss-ing at random conditional on observed covariates(Donders, van der Heijden, Stijnen, & Moons,2006). Models using different equating methodswere compared using standard model fit statisticsincluding the root mean square error of approxi-mation (RMSEA; Steiger, 1989) and comparativefit index (CFI; Hu & Bentler, 1999). These fit statis-tics were of key importance because they provide ameasure of how much the model-estimated baselinelevels and trajectories fit to observed trajectoriesusing each method of equating. An RMSEA lessthan 0.05 and CFI greater than 0.95 are consid-ered indicators of excellent model fit (Hu & Bentler,1999). Graphical displays and equating algorithmswere generated using Stata 12.0 (StataCorp, 2011)and R software packages (R Development CoreTeam, 2009).

RESULTS

Table 1 shows baseline characteristics and AVLTtest scores at each follow-up time for ACTIVE andADNI samples. ACTIVE participants were mostlywhite females aged 65–94 years and cognitivelyintact at baseline. On average, ADNI participantswere younger and more highly educated, and ahigher proportion of them were males than in theACTIVE sample.

Evaluation of equating methods with visualdisplays

Mean recall over time for ACTIVE control andmemory-trained participants under different equat-ing methods are plotted in each panel of Figure 1.Figure 2 provides similar information using theADNI MCI diagnostic group. In ACTIVE, plots

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

6 GROSS ET AL.

TABLE 1Descriptive characteristics of cohorts used in the present study

ACTIVE ADNI

Memory-trained Control Healthy controlMild cognitive

impairmentAlzheimer’s

diseaseVariable (n = 703) (n = 698) (n = 229) (n = 397) (n = 193)

Age [mean (SD)] 73.5 (6.0) 74.1 (6.1) 67.8 (11.8) 67.6 (11.7) 67.9 (12.6)Years of education [mean (SD)] 13.6 (2.7) 13.4 (2.7) 16.0 (2.9) 15.7 (3.0) 14.7 (3.1)Sex [n (% female)] 537 (76.0) 514 (74.1) 110 (48.0) 141 (35.5) 91 (47.2)Ethnicity [n (% white)] 521 (74.3) 500 (72.2) 210 (91.7) 371 (93.5) 181 (93.8)MMSE score [mean (SD)] 27.3 (2.1) 27.3 (2.0) 29.1 (1.0) 27.0 (1.8) 23.3 (2.1)

AVLT scoresFirst follow-up (baseline) 48.8 (10.6) 47.9 (11.0) 43.2 (9.3) 30.7 (9.0) 23.0 (7.8)Second follow-up 48.1 (10.9) 46.1 (11.5) 41.3 (10.2) 28.1 (9.3) 20.2 (7.9)Third follow-up 47.1 (10.7) 44.6 (10.8) 43.8 (10.4) 29.6 (10.3) 20.0 (8.1)Fourth follow-up 50.3 (10.3) 47.9 (11.2) — 27.7 (10.1) —Fifth follow-up 51.5 (11.0) 49.4 (11.5) 44.6 (10.6) 28.4 (11.6) 16.8 (8.9)Sixth follow-up 47.6 (11.4) 45.2 (12.0) 40.4 (10.3) 26.3 (11.6) —

Note. ACTIVE = Advanced Cognitive Training for Independent and Vital Elderly study. ADNI = Alzheimer’s Disease NeuroimagingInitiative. AVLT = Auditory Verbal Learning Test. MMSE = Mini-Mental State Examination. The first follow-up in ACTIVE andADNI was the baseline visit. The second follow-up visit was at immediate posttraining in ACTIVE and at six months in ADNI. Thethird, fourth, and fifth follow-up visits were at one, two, and three years after baseline in both ACTIVE and ADNI. The sixth follow-upvisit in ACTIVE was five years after initial training.

Figure 1. Parallel but nonequivalent forms: Plots of raw and equated Auditory Verbal Learning Test (AVLT) scores over time inAdvanced Cognitive Training for Independent and Vital Elderly study (ACTIVE; N = 1,401). Time trend plots present means of AVLTscores by study visit in the ACTIVE control and memory-trained groups. Means in the time trend plots are adjusted for selective attritionusing random-effects models that assume data are missing at random conditional on indicators for time and group. Letters correspondto AVLT list versions administered at a visit: In ACTIVE, the baseline and Year 3, and posttraining and Year 5, study visits used thesame AVLT form.

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

PARALLEL BUT NOT EQUIVALENT COGNITIVE TESTS 7

Figure 2. Parallel but nonequivalent forms: Plots of raw and equated Auditory Verbal Learning Test (AVLT) scores over time inAlzheimer’s Disease Neuroimaging Initiative (ADNI) mild cognitive impairment (MCI) participants (N = 397). Time trend plots presentmeans of AVLT scores by study visit in the ADNI MCI group. Means in the time trend plots are adjusted for selective attrition usingrandom-effects models that assume data are missing at random conditional on indicators for time and group. Letters correspond toAVLT list versions administered at a visit: In ADNI, the baseline, 12-month, and 24-month visits used the same form, and the 6-month,18-month, and 36-month visits used a different form.

of raw scores give the impression that both groupsstart at about the same level at the baseline visit,decline up to two years after training, and thenrecover inexplicably (Figure 1). In ADNI, the MCIgroup zigzags in performance at every other visitby approximately 0.3 standard deviations. AVLTtest scores in Table 1 show that raw score trendsin Figure 1 generalize to all intervention and diag-nostic groups. The effect of nonequivalent formsis demonstrated more rigorously using cumulativeprobability plots in the Appendix (see legends fora detailed interpretation). They reveal different dif-ficulty levels across the score distribution for differ-ent waves: Participants in both ACTIVE and ADNIperforming at the 50th percentile on follow-up testshad systematically lower scores than participants atthe 50th percentile of the baseline test. Scores atthe lowest and highest performance percentiles weremore comparable across visits.

Mean equating in ACTIVE produced more sim-ilar means, but overcompensated for differencesin forms at the third and fifth annual visits(Figure 1). Mean equating in ADNI produced

a plot of a declining trajectory, although someresidual form differences remained at the 24-monthvisit (Figure 2). Cumulative probability plots formean equating did not overlap as well as for otherequating methods (Appendix).

Linear equating, like mean equating, revealedboosted performance at one year followed bydecline in ACTIVE but also suggests improvementat the immediate posttraining visit and a plateau inperformance after the third annual visit (Figure 1).Linear equating in ADNI nearly eliminated anyindication of decline in the mean level of perfor-mance over time, which should be expected in asample of MCI patients.

Equipercentile equating produced the smoothesttrajectories in ACTIVE, with an expected pre–post training gain and age-related cognitive decline(Figure 1). In ADNI, the wave pattern over timewas successfully removed while an average declineof 1.8 AVLT words through three years was stillapparent (Figure 2).

Although the graphical displays demonstrate thesuperiority of equating methods in relation to raw

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

8 GROSS ET AL.

scores, they also indicate residual imprecision ofthese methods: In ACTIVE, the second visit wasa posttest assessment only 10 weeks after the first.The equating methods should theoretically produceequal means between the baseline and immedi-ate posttest assessments because there was almostno attrition or aging in the control group. Meanequating does not because of small residual formdifferences that persist after equating. Linear andequipercentile equating produce means at baselineand immediate posttraining in ACTIVE very closeto each other (Figure 1).

Cumulative probability plots are shown inthe Appendix. Plots for equated scores demon-strate excellent overlap, especially for equipercentileequating in ACTIVE (Figure A1) and linear andequipercentile equating in ADNI (Figure A2).Thus, equipercentile equating visually demon-strates the best adjustment for learning effects andsmoothes out mean trajectories.

Evaluation of equating methods with tests ofmean equivalence over time

Negligible change over time was assumed inthe ADNI healthy control and ACTIVE controlgroups. In ADNI, raw, χ2(4) = 14.8, p = .001,and mean equating, χ2(4) = 16.7, p < .001, pro-duced significantly different means over time, butlinear, χ2(5) = 6.8, p = .18, and equipercentileequating, χ2(4) = 7.6, p = .09, produced statis-tically equivalent means over time in the healthycontrol group. In the ACTIVE control group, therewere significant differences in raw AVLT recall sumscore means, χ2(5) = 45.3, p < .001, mean-equatedmeans, χ2(5) = 58.2, p < .001, linear-equatedmeans, χ2(5) = 63.6, p < .001, and equipercentile-equated means, χ2(5) = 31.9, p < .001.

By the fifth year visit, n = 749 of 1,401 (53%) par-ticipants were still in the study sample, and attritiondid not differ by intervention group. Follow-up inADNI after three years was higher (n = 591/819,72%). Our propensity adjustment for sample attri-tion, which had a larger effect in the ACTIVEstudy, is likely responsible for the lack of equiv-alence of equipercentile-equated means, which asshown in Figure 1 demonstrates a smooth decliningtrajectory.

Evaluation of equating methods with modelsof within-person longitudinal trajectories

Results from ACTIVE and ADNI are shownin Tables 2 and 3, respectively, for raw, mean,linear, and equipercentile-equated scores. Means

and variances of growth parameters in thetables characterize level and variability in within-person trajectory of AVLT performance over time.An immediate contrast is in poor RMSEA andCFI model fits using raw AVLT scores, shown inTables 2 and 3, that improve dramatically given anyequating method but are best after equipercentileequating. The model with equipercentile-equatedAVLT scores in ADNI fit perfectly with the data(RMSEA: 0.0; CFI = 1.0).

Besides fit statistics, three key substantive infer-ences change depending on the type of equatingused. First, using raw scores in ACTIVE, the imme-diate training “boost” appears to be in the negativedirection for memory-trained participants (pre–post change: −5.0 words) but flips to a positivedirection after equating (Table 2). Second, in bothACTIVE and ADNI, annual memory decline inall groups, as indicated by slope means, is overes-timated or underestimated to varying degrees usingraw scores, while equated scores show less annualdecline in ACTIVE control participants (Table 2)and ADNI MCI and AD participants (Table 3).A third substantive change in inferences is that thecorrelation between initial recall and aging trajec-tory is overestimated using raw scores relative toequated scores. Based on model fits and substan-tive knowledge of trajectories of cognitive aging,latent growth models suggest that differences intest difficulty are handled best with either linear orequipercentile equating.

DISCUSSION

The present study investigated different methods ofequating AVLT word list versions in longitudinalaging research. We adapted accepted test equat-ing methods using a novel approach to the studyof longitudinal cognitive aging. These methods arebroadly applicable to within- and between-groupcomparisons of test performance data in bothresearch and clinical settings. Equipercentile equat-ing uses observed percentiles of a distribution andis a more generalizable nonparametric transforma-tion than linear equating, which assumes normallydistributed variables whose distributions are fullycharacterized by a mean and standard deviation.Graphical displays clearly show that equipercentileequating accommodates tests that are more diffi-cult than the reference test at different percentiles ofperformance, and models of within-person changeshow that it also satisfactorily adjusts for practice,or retest, effects. Importantly, an implicit assump-tion of mean, linear, or equipercentile equating isthat the populations producing two sets of scores,

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

PARALLEL BUT NOT EQUIVALENT COGNITIVE TESTS 9

TABLE 2AVLT growth parameters using different equating methods: Results from ACTIVE

Equating method

Raw scores Mean Linear EquipercentileParameter Estimate (SE) Estimate (SE) Estimate (SE) Estimate (SE)

MeansControl group

Baseline (initial level) 47.0 (0.4) 48.1 (0.4) 48.1 (0.4) 47.9 (0.4)Immediate pre–post training change 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)Slope −0.5 (0.1) −0.9 (0.1) −0.9 (0.1) −0.7 (0.1)

Memory-trained groupBaseline (initial level) 48.9 (0.4) 48.9 (0.4) 48.9 (0.4) 48.9 (0.4)Immediate pre–post training change −5.0 (1.3) 5.7 (1.3) 5.9 (1.3) 4.1 (1.2)Slope 0.0 (0.1) −0.8 (0.1) −0.8 (0.1) −0.6 (0.1)

Group differences (memory trained – control)Baseline (initial level) 1.9 (0.6) 0.8 (0.6) 0.8 (0.6) 0.9 (0.5)Immediate pre–post training change −5.0 (1.3) 5.7 (1.3) 5.9 (1.3) 4.1 (1.2)Slope 0.4 (0.1) 0.1 (0.1) 0.1 (0.1) 0.1 (0.1)

VariancesBaseline (initial level) 91.5 (5.5) 91.6 (5.4) 93.3 (5.4) 89.0 (5.3)Immediate pre–post training change 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)Slope 0.4 (0.2) 0.6 (0.2) 0.6 (0.2) 0.5 (0.2)

Correlation (baseline, slope) .2 (0.1) .1 (0.1) .1 (0.1) .0 (0.1)Model fit statistics

RMSEA 0.148 0.051 0.058 0.023CFI 0.889 0.987 0.983 0.997

Note. N = 1,401. ACTIVE = Advanced Cognitive Training for Independent and Vital Elderly study. AVLT = Auditory Verbal LearningTest. RMSEA = root mean square error of approximation. CFI = comparative fit index. Multiple group latent growth models inACTIVE of raw AVLT recall and mean, linear, and equipercentile equated scores. Results suggest that equipercentile equated scoresfit the hypothesized model best and that estimates of pre–post training change, long-term change, and correlation between initial leveland slope are sensitive to the equating method used. Models assumed a linear function for time and accommodated nonlinear changein trajectory between baseline and posttraining in the memory-trained group. The slope terms reflect annual change in AVLT recall, inunits of total words recalled. The pre–post change parameter’s mean in the control group and variance in both groups were fixed to 0 toidentify the model.

whether they are the same people followed overtime or two different groups, have the same underly-ing ability. Because this may not be a valid assump-tion for older adults followed for years, the presentstudy described equating procedures that used agestandardization to preserve aging effects, propen-sity weighting to adjust for attrition, and restrictionto preserve group differences due to diagnostic andintervention group membership.

ACTIVE is the largest study of cognitive train-ing among older adults to date, and ADNI is a$60 million public–private partnership that is beingused to stimulate innovative methods for evaluat-ing progression of AD in clinical trials. The rollercoaster trajectory in ACTIVE and waves in ADNIare attributable to nonequivalent AVLT forms usedat different study visits. The ACTIVE study cycledthrough four versions of the AVLT until repeat-ing of the baseline list at the third annual follow-up. ADNI cycled between two AVLT lists, whichexplains the wave-like pattern. These method arti-facts may be present in other settings. Indeed,important form differences are seen for the Hopkins

Verbal Learning Test (Brandt & Benedict, 2001) inthe ACTIVE study (data not shown) and for theAlzheimer’s Disease Assessment Scale–Cognitivesubscale (ADAS–Cog) word list-learning task inADNI (Crane et al., 2012). Similar plots and statis-tics presented in this study can be replicated usingthese measures. The reason these studies used alter-nate word lists was to reduce practice effects, butin doing so they introduced complications for mak-ing inferences about cognitive performance. AllACTIVE publications involving comparisons ofwithin-person memory performance over time useequipercentile-equated scores (e.g., Gross & Rebok,2011; Gross, Rebok, Unverzagt, Willis, & Brandt,2011; Parisi et al., 2011). Aside from work inACTIVE, we are not aware of equipercentile equat-ing being used in longitudinal settings with cogni-tive performance data. We believe the field can ben-efit by being aware of and adopting these equatingmethods. To date, most published studies that haveused longitudinal neuropsychological data fromADNI have not examined the AVLT from visitsin which different AVLT forms were administered

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

10 GROSS ET AL.

TABLE 3AVLT growth parameters using different equating methods: Results from ADNI

Equating method

Raw scores Mean Linear EquipercentileParameter Estimate (SE) Estimate (SE) Estimate (SE) Estimate (SE)

MeansHealthy control group

Baseline (initial level) 42.4 (0.6) 44.5 (0.6) 43.2 (0.6) 43.3 (0.6)Slope 0.8 (0.3) 0.0 (0.3) 0.2 (0.2) −0.2 (0.2)

Mild cognitive impairment (MCI) groupBaseline (initial level) 30.2 (0.4) 32.3 (0.4) 30.7 (0.4) 30.8 (0.4)Slope −1.4 (0.2) −2.2 (0.2) −0.6 (0.2) −0.9 (0.2)

Alzheimer’s disease (AD) groupBaseline (initial level) 22.6 (0.5) 24.7 (0.5) 23.0 (0.5) 23.2 (0.5)Slope −3.2 (0.3) −4.2 (0.3) −1.4 (0.3) −1.7 (0.3)

Group differences (MCI – AD)Baseline (initial level) 7.6 (0.7) 7.6 (0.7) 7.7 (0.7) 7.7 (0.7)Slope 1.8 (0.4) 1.9 (0.4) 0.9 (0.3) 0.8 (0.3)

VariancesInitial level 59.4 (8.1) 61.9 (7.6) 65.9 (7.7) 65.5 (8.0)Slope −0.7 (2.7) 4.6 (1.4) 2.9 (1.1) 3.5 (1.1)

Correlation (baseline, slope) .3 (.1) .3 (.1) −.1 (.1) .0 (.1)Model fit statistics

RMSEA 0.12 0.07 0.06 0.00CFI 0.95 0.98 0.99 1.00

Note. N = 819. ADNI = Alzheimer’s Disease Neuroimaging Initiative. AVLT = Auditory Verbal Learning Test. RMSEA = root meansquare error of approximation. CFI = comparative fit index. Multiple group latent growth models in ADNI of raw AVLT recall andmean, linear, and equipercentile equated scores. Results suggest that equipercentile equated scores fit the hypothesized model best andthat estimates of pre–post training change, long-term change, and correlation between initial level and slope are sensitive to the equatingmethod used. Models assumed a linear function for time. The slope terms reflect annual change in AVLT recall, in units of total wordsrecalled. Latent growth parameter variances were held constant over diagnostic group.

(e.g., Hinrichs, Singh, Xu, Johnson, & AlzheimersDisease Neuroimaging Initiative, 2011; Murphyet al., 2010; Petersen et al., 2010). In other studiesusing ADNI data, word lists are treated as compo-nents in composite measures (e.g., Beckett et al.,2010), but results of some studies are potentiallysusceptible to nonequivalent form differences (e.g.,Carmichael et al., 2010; Okonkwo et al., 2011).Future work in ADNI should pay close attentionto form differences on the AVLT and ADAS–Cog.

Equating methods are powerful tools, but theiruse comes with several caveats. First, measuresshould not be equated that have different mean-ings. For example, it is statistically possible toequate short-delay and long-delay recall trials, butthe trials measure qualitatively different constructs.Relatedly, equating methods can equate test scoresbut do not address qualitative differences in behav-iors, such as different strategies used on moredifficult tests at different measurement occasions(Crawford et al., 1989; Light, 1991). A second limi-tation of equating is that populations that producetwo sets of test scores must have the same under-lying ability to be validly equated. This is an easyassumption to make when the same cognitivelynormal persons are being retested over time, but

may not be achievable (or measurable) in all situ-ations. The application of equating methods in thepresent study would have been fairly straightfor-ward if we had assumed this. However, in studieswith several years of longitudinal follow-up suchas those in the present study, one can divide theequating task into two stages as we have done: iden-tify a subset of observations as an equating samplein which underlying abilities can be assumed to bethe same over time, then apply the equating algo-rithm derived in that sample to the full sample.A third limitation is that, in longitudinal settings,equating procedures assume that the magnitude ofretest effects is exchangeable across groups. Thisassumption may be unreasonable when compar-ing patients with different clinical syndromes ordiseases, such as delirium or amnesia. Fourth, a lim-itation specific to equipercentile equating is that theoutcome should be continuously distributed andhave enough range to reliably distinguish differentquantiles. Applying equipercentile equating to indi-vidual AVLT trial recall scores, for example, wouldbe more challenging. This is not a concern in linearequating, which presumes a normally distributedoutcome. Another limitation of this study is that weassumed that the underlying trajectory of change

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

PARALLEL BUT NOT EQUIVALENT COGNITIVE TESTS 11

in AVLT performance is in fact linear over time.We used this assumption in growth models to assessthe different equating methods. Previous work inACTIVE has demonstrated that memory follows alinear pace of change following the immediate post-training visit (Gross & Rebok, 2011; Parisi et al.,2011). The assumption of linear change in cogni-tive function among older adults is a commonlyaccepted fact in many other studies of older adults(e.g., Proust, Jacqmin-Gadda, Taylor, Ganiayre,& Commenges, 2006). Nevertheless, because truechange is a latent and unobserved phenomenon,whether the AVLT in ACTIVE and ADNI in factshows linear decline over time is uncertain. A finalpotential limitation specific to the ACTIVE studyis that modifications in test administration of theAVLT from standard clinical administration limitthe generalizability of findings from these datato clinical settings. However, our purpose in thepresent study was to illustrate equating methodsand not to make inferences about training effectson memory function in ACTIVE, which have beenreported elsewhere (e.g., Gross & Rebok, 2011;Parisi et al., 2011; Willis et al., 2006).

Mean, linear, and equipercentile equating, basedon classical test theory, are not the only equat-ing methods. Item response theory (IRT) methodscan be used if populations producing two sets ofscores differ in the underlying ability being mea-sured, but require some items in common betweenthe tests to anchor the two groups with respect toeach other (Livingston, 2004). Counterbalancing isa method of adjusting for form differences in thestudy design before analysis, but is useful only formaking inferences about group differences and notwithin-person change (Cozby, 2009).

Although equipercentile equating proved to beideal for the applications of the present study, thesame procedure may not apply in all cases. Meanequating is intuitive and produces the same grandmean on two tests, but it does not change an indi-vidual’s absolute difference from the mean. Thus,mean equating can lead to impossible or improba-ble scores among some individuals; for example, iftwo tests means are 60 and 50, and the maximumpossible value is 100, then an individual scoringa 95 on the second test will have a mean-equatedscore of 105. Similar to mean equating, a limita-tion of linear equating is that extreme scores on anew test may yield equated scores outside the possi-ble range of values in the original test. This is nota concern in equipercentile equating. The princi-pal advantage of equipercentile equating over linearequating is that it does not assume that the refer-ence test is normally distributed, but there are casesin which that assumption is viable. The AVLT in

ACTIVE and ADNI was approximately normallydistributed, which explains the similarities in find-ings between linear and equipercentile equating.

Clinically, a patient’s test scores can only beinterpreted using appropriate reference norms, butnormative values are unhelpful if normative testscores come from a different population from whichthe patient came. Tests shown to be equivalent incertain groups defined by education, sex, or age maynot be equivalent in other subpopulations (Ivniket al., 1990). For this reason, Schmidt (2004) reportsAVLT word lists that produce similar scores forolder adults, in addition to which lists produce simi-lar scores to other lists. Equating techniques requiredata from cohorts of individuals, so it would not bepossible to perform similar analyses for any partic-ular person being evaluated clinically. Nevertheless,important differences in form difficulty should bekept in mind, and if different forms are used acrosstime, this should be documented. Data from stud-ies similar to the one presented here may be usefulto assist the practitioner in understanding whetherchange has occurred and, if so, its likely directionand magnitude. Ignoring differences in difficultyacross forms in clinical settings could lead to unnec-essary confusion at least and incorrect conclusionsor diagnoses at worst. Finally, it is important toacknowledge that equated data contribute to only asmall part of the clinical picture. A clinician’s judg-ment of change will depend on multiple test resultsand findings, the clinical history, nonquantitativeobservations about the patient’s abilities (Lezaket al., 2004), and his or her expert judgment andprior experience (Mitrushina et al., 2005).

In conclusion, equating challenges are pervasivebut often unrecognized in research studies andclinical practice. When prior knowledge aboutform equivalence is unavailable or unclear whenplanning a study, we recommend that researchersuse the same form and apply established meth-ods to control for practice effects (e.g., Ferrer,Salthouse, McArdle, Stewart, & Schwartz, 2005;Ferrer, Salthouse, Stewart, & Schwartz, 2004;Rabbitt, Diggle, Holland, & McInnes, 2004;Salthouse, 2010; Salthouse, Schroeder, & Ferrer,2004; Salthouse & Tucker-Drob, 2008). Thoroughdata exploration is necessary both to recognizethe need for equating and to understand therelative merits of different equating procedures.The replication of findings across two cohorts,utilizing special weighting adaptations, highlightsthe versatility and generalizability of the equatingmethods used in the present study.

The method of equipercentile equating may havebroad applications in both clinical and researchsettings to enhance the ability to use nonequivalent

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

12 GROSS ET AL.

test forms, to evaluate change over time, to quan-tify retest effects, and to align scores on differ-ent tests of the same construct (such as identify-ing cutpoints for dementia on cognitive screeningtests). Equipercentile equating is a well-acceptedtool for comparing psychiatric diagnostic instru-ments (Furukawa et al., 2009; Leucht et al., 2005;Montoya et al., 2011; Noonan et al., in press;Schennach-Wolff et al., 2010) and for identifyingclinically relevant benchmarks and crosswalks onneuropsychological tests (Fong et al., 2009, 2011).The procedure represents a robust and innovativeapproach to better understanding of longitudinalchanges over time. The present study demonstratedan innovative application of equating methodsfor longitudinal settings in which participants orpatients are followed over long periods of time.

Original manuscript received 30 October 2011Revised manuscript accepted 17 March 2012

First published online day month year

REFERENCES

Albano, A. (2011). equate: Statistical methods for testscore equating. R package Version 1.1–2. [computersoftware] Retrieved December 19, 2011, from http://CRAN.R-project.org/package=equate

Alzheimer’s Disease Neuroimaging Initiative. Welcome toADNI. Retrieved April 12, 2012 from http://adni.loni.ucla.edu/.

Anderson, J. R., & Bower, G. H. (1972). Recognition andretrieval processes in free recall. Psychological Review,79, 97–123.

Ball, K., Berch, D. B., Helmers, K. F., Jobe, J. B., Leveck,M. D., Marsiske, M., et al. (2002). Effects of cognitivetraining interventions with older adults: A random-ized controlled trial. Journal of the American MedicalAssociation, 288, 2271–2281.

Battig, W. F., & Montague, W. E. (1969). Categorynorms for verbal items in 56 categories: A replica-tion and extension of the Connecticut category norms[Monograph]. Journal of Experimental Psychology, 80,1–46.

Beckett, L. A., Harvey, D. J., Gamst, A., Donohue, M.,Kornak, J., Zhang, H., et al. (2010). The Alzheimer’sDisease Neuroimaging Initiative: Annual change inbiomarkers and clinical outcomes. Alzheimer’s &Dementia: The Journal of the Alzheimer’s Association,6, 257–264.

Bollen, K. A., & Curran, P. J. (2006). Latent curve models:A structural equation approach. Hoboken, NJ: Wiley.

Borsboom, D. (2005). Measuring the mind: Conceptualissues in contemporary psychometrics. Cambridge,UK: Cambridge University Press.

Brandt, J., & Benedict, R. H. B. (2001). Hopkins VerbalLearning Test–Revised: Professional manual. Odessa,FL: Psychological Assessment Resources.

Carmichael, O., Schwarz, C., Drucker, D., Fletcher, E.,Harvey, D., Beckett, L., et al. (2010). Longitudinalchanges in white matter disease and cognition in

the first year of ADNI. Archives of Neurology, 67,1370–1378.

Cozby, P. C. (2009). Methods in behavioral research (10thed.). New York, NY: McGraw-Hill.

Crane, P. K., Carle, A., Gibbons, L. E., Insel, P., Mackin,S., Gross, A. L., et al. (for the Alzheimer’s DiseaseNeuroimaging Initiative). (2012). Development andassessment of a psychometrically sophisticated com-posite score for memory in the Alzheimer’s DiseaseNeuroimaging Initiative (ADNI). Manuscript submit-ted for publication.

Crawford, J. R., Stewart, L. E., & Moore, J. W. (1989).Demonstration of savings on the AVLT and devel-opment of a parallel form. Journal of Clinical andExperimental Neuropsychology, 11, 975–981.

Donders, A. R., van der Heijden, G. J., Stijnen, T., &Moons, K. G. (2006). Review: A gentle introductionto imputation of missing values. Journal of ClinicalEpidemiology, 59, 1087–1091.

Ebbinghaus, H. (1964). Memory: A contribution to exper-imental psychology. New York, NY: Dover. (Originalwork published 1895)

Ferrer, E., Salthouse, T., McArdle, J., Stewart, W., &Schwartz, B. (2005). Multivariate modeling of ageand retest in longitudinal studies of cognitive abilities.Psychology & Aging, 20(3), 412–422.

Ferrer, E., Salthouse, T. A., Stewart, W. F., & Schwartz,B. S. (2004). Modeling age and retest processes in lon-gitudinal studies of cognitive abilities. Psychology &Aging, 19(2), 243–259.

Fong, T. G., Fearing, M. A., Jones, R. N., Shi, P.,Marcantonio, E. R., Rudolph, J. L., et al. (2009).Telephone interview for cognitive status: Creating acrosswalk with the Mini-Mental State Examination.Alzheimers & Dementia, 5(6), 492–497.

Fong, T. G., Jones, R. N., Rudolph, J. L., Yang, F. M.,Tommet, D., Habtemariam, D., et al. (2011). Develop-ment and validation of a brief cognitive assessmenttool: The sweet 16. Archives of Internal Medicine,171(5), 432–437.

Fuller, K. H., Gouvier, W. D., & Savage, R. M. (1997).Comparison of List B and List C of the Rey AuditoryVerbal Learning Test. The Clinical Neuropsychologist,11, 201–204.

Furukawa, T. A., Shear, M. K., Barlow, D. H.,Gorman, J. M., Woods, S. W., Money, R., et al.(2009). Evidence-based guidelines for interpretationof the Panic Disorder Severity Scale. Depression andAnxiety, 26(10), 922–929.

Geffen, G. M., Butterworth, P., & Geffen, L. B. (1994).Test–retest reliability of a new form of the AuditoryVerbal Learning Test (AVLT). Archives of ClinicalNeuropsychology, 9, 303–316.

Gross, A. L., & Rebok, G. W. (2011). Memory train-ing and strategy use among older adults: Results fromthe ACTIVE cognitive intervention trial. Psychology& Aging, 26, 503–517.

Gross, A. L., Rebok, G. W., Unverzagt, F. W.,Willis, S. L., & Brandt, J. (2011). Cognitive pre-dictors of everyday functioning in the elderly:Results from the ACTIVE cognitive intervention trial.Journal of Gerontology: Psychological Sciences, 66,557–566.

Hawkins, K. A., Dean, D., & Pearlson, G. D. (2004).Alternative forms of the Rey Auditory VerbalLearning Test: A review. Behavioural Neurology, 15,99–107.

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

PARALLEL BUT NOT EQUIVALENT COGNITIVE TESTS 13

Hinrichs, C., Singh, V., Xu, G., Johnson, S. C., &Alzheimers Disease Neuroimaging Initiative. (2011).Predictive markers for AD in a multi-modality frame-work: An analysis of MCI progression in the ADNIpopulation. NeuroImage, 55, 574–589.

Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fitindices in covariance structure analysis: Conventionalversus new alternatives. Structural Equation Modeling,6, 1–55.

Ivnik, R. J., Malec, J. F., Tangalos, E. G., Petersen,R. C., Kokmen, E., & Kurland, L. T. (1990). TheAuditory Verbal Learning Test (AVLT): Norms forages 55 and older. Psychological Assessment: AJournal of Consulting and Clinical Psychology, 2(3),304–312.

Jobe, J. B., Smith, D. M., Ball, K., Tennstedt, S. L.,Marsiske, M., Willis, S. L., et al. (2001). ACTIVE: Acognitive intervention trial to promote independencein older adults. Controlled Clinical Trials, 22, 453–479.

Kolen, M., & Brennan, R. (1995). Test equating: Methodsand practices. New York, NY: Springer.

Leucht, S., Kane, J. M., Kissling, W., Hamann, J.,Etschel, E., & Engel, R. R. (2005). What doesthe PANSS mean? Schizophrenia Research, 79(2–3),231–238.

Lezak, M. D., Howieson, D. B., & Loring, D. W. (2004).Neuropsychological assessment (2nd ed.). New York,NY: Oxford University Press.

Light, L. (1991). Memory and aging: Four hypothesesin search of data. Annual Review of Psychology, 42,333–376.

Livingston, S. A. (2004). Equating test scores(without IRT). Educational Testing Service.Retrieved from http://www.ets.org/Media/Research/pdf/LIVINGSTON.pdf

McArdle, J. J., & Bell, R. Q. (2000). Recent trends inmodeling longitudinal data by latent growth curvemethods. In T. D. Little, K. U. Schnabel, & J. Baumert(Eds.), Modeling longitudinal and multiple-group data:Practical issues, applied approaches, and scientificexamples (pp. 69–108). Mahwah, NJ: LawrenceErlbaum Associates.

Mitrushina, M., Boone, K. B., Razani, J., & D’Elia,L. F. (2005). Handbook of normative data forneuropsychological assessment (2nd ed.). New York,NY: Oxford University Press.

Montoya, A., Valladares, A., Lizán, L., San, L.,Escobar, R., & Paz, S. (2011). Validation of theExcited Component of the Positive and NegativeSyndrome Scale (PANSS-EC) in a naturalistic sampleof 278 patients with acute psychosis and agitation ina psychiatric emergency room. Health and Quality ofLife Outcomes, 9, 18.

Murphy, E. A., Holland, D., Donohue, M., McEvoy,L. K., Hagler, D. J. Jr., Dale, A. M., et al. (2010).Six-month atrophy in MTL structures is associatedwith subsequent memory decline in elderly controls.NeuroImage, 53, 1310–1317.

Muthén, B. O. (1997). Latent variable modeling with lon-gitudinal and multilevel data. In A. Raftery (Ed.),Sociological methodology (pp. 453–480). Boston, MA:Blackwell Publishers.

Muthén, B. O., & Curran, P. J. (1997). General longitudi-nal modeling of individual differences in experimentaldesigns: A latent variable framework for analysisand power estimation. Psychological Methods, 2,371–402.

Muthén, L. K., & Muthén, B. O. (1998–2010). Mplususer’s guide (6th ed., computer software). LosAngeles, CA: Muthén & Muthén.

Noonan, V. K., Cook, K. F., Bamer, A. M., Choi, S.W., Kim, J., & Amtmann, D. (in press). Measuringfatigue in persons with multiple sclerosis: Creatinga crosswalk between the Modified Fatigue ImpactScale and the PROMIS Fatigue Short Form. Qualityof Life Research. Advance online publication. doi:10.1007/s11136-011-0040-3

Okonkwo, O. C., Mielke, M. M., Griffith, H. R.,Moghekar, A. R., O’Brien, R. J., Shaw, L. M.,et al. (2011). Cerebrospinal fluid profiles and prospec-tive course and outcome in patients with amnesticmild cognitive impairment. Archives of Neurology, 68,113–119.

Paivio, A. (1968). A factor-analytic study of wordattributes and verbal learning. Journal of VerbalLearning and Verbal Behavior, 7, 41–49.

Parisi, J. M., Gross, A. L., Rebok, G. W., Saczynski,J. S., Crowe, M., Cook, S. E., et al. (2011). Modelingchange in memory performance and perceptions:Findings from the ACTIVE study. Psychology &Aging, 26, 518–524.

Petersen, R. C., Aisen, P. S., Beckett, L. A., Donohue,M. C., Gamst, A. C., Harvey, D. J., et al.(2010). Alzheimer’s Disease Neuroimaging Initiative(ADNI): Clinical characterization. Neurology, 74,201–209.

Proust, C., Jacqmin-Gadda, H., Taylor, J. M.,Ganiayre, J., & Commenges, D. (2006). A nonlinearmodel with latent process for cognitive evolutionusing multivariate longitudinal data. Biometrics, 62,1014–1024.

R Development Core Team. (2009). R: A languageand environment for statistical computing [computersoftware]. Vienna, Austria: R Foundation forStatistical Computing. Retrieved from http://www.R-project.org

Rabbitt, P., Diggle, P., Holland, F., & McInnes, L. (2004).Practice and drop-out effects during a 17-year longitu-dinal study of cognitive aging. Journal of Gerontology,Series B: Psychological & Social Sciences, 59(2),84–97.

Rey, A. (1964). L’examen clinique en psychologie [Theclinical examination in psychology]. Paris, France:Presses Universitaires de France.

Rosenbaum, P. R., & Rubin, D. (1983). The centralrole of the propensity score in observationalstudies for causal effects. Biometrika, 70(1),41–55.

Salthouse, T. A. (2010). Influence of age on prac-tice effects in longitudinal neurocognitive change.Neuropsychology, 24, 563–572.

Salthouse, T., Schroeder, D., & Ferrer, E. (2004).Estimating retest effects in longitudinal assessmentsof cognitive functioning in adults between 18 and60 years of age. Developmental Psychology, 40(5),813–822.

Salthouse, T. A., & Tucker-Drob, E. M. (2008).Implications of short-term retest effects for the inter-pretation of longitudinal change. Neuropsychology,22, 800–811.

Schennach-Wolff, R., Obermeier, M., Seemüller, F.,Jäger, M., Schmauss, M., Laux, G., et al. (2010).Does clinical judgment of baseline severity andchanges in psychopathology depend on the patient

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

14 GROSS ET AL.

population? Results of a CGI and PANSS linkinganalysis in a naturalistic study. Journal of ClinicalPsychopharmacology, 30, 726–731.

Schmidt, M. (2004). Rey Auditory and Verbal LearningTest: A handbook. Los Angeles, CA: WesternPsychological Services.

StataCorp. (2011). Stata Statistical Software: Release 1[Computer software]. College Station, TX: StataCorpLP.

Steiger, J. H. (1989). EZPATH: A supplementary modulefor SYSTAT and SYGRAPH. Evanston, IL: Systat.

Taylor, E. M. (1959). The appraisal of children with cere-bral deficits. Cambridge, MA: Harvard UniversityPress.

Underwood, B. J. (1963). Coding processes in ver-bal learning. Journal of Verbal Learning and VerbalBehavior, 1, 250–257.

Willis, S. L., Tennstedt, S. L., Marsiske, M., Ball, K.,Elias, J., Koepke, K. M., et al. (2006). Long-termeffects of cognitive training on everyday functionaloutcomes in older adults. Journal of the AmericanMedical Association, 296, 2805–2814.

APPENDIX

Cumulative probability plots

0

.2

.4

.6

.8

1

Cum

ulat

ive

prop

ortio

n

0 20 40 60 80Recall score

Raw scores

0

.2

.4

.6

.8

1

Cum

ulat

ive

prop

ortio

n

0 20 40 60 80Recall score

Linear equating

0

.2

.4

.6

.8

1

Cum

ulat

ive

prop

ortio

n

0 20 40 60 80Recall score

Mean equating

0

.2

.4

.6

.8

1

Cum

ulat

ive

prop

ortio

n

0 20 40 60 80Recall score

Equipercentile equating

Figure A1. Parallel but nonequivalent forms: Cumulative probability plots of raw and equated Auditory Verbal Learning Test (AVLT)scores in Advanced Cognitive Training for Independent and Vital Elderly (ACTIVE) study. Cumulative probability plots overlay dis-tributions of raw or equated AVLT recall sum scores among control participants from each ACTIVE study visit (n = 698). Plots foreach visit are not labeled clearly because the purpose of this diagnostic plot is to assess degree of overlap of each visit’s cumulativedistribution. Results suggest that linear and equipercentile equating produce more overlap than other methods. As an example of how tointerpret this plot, in the raw scores panel, the blue line shows that about 30% of ACTIVE control participants recalled up to 39 wordsin Year 1, and 100% of participants at all waves recalled 75 words or less (the test’s ceiling). To view a color version of this figure, pleasesee the online issue of the Journal.

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012

PARALLEL BUT NOT EQUIVALENT COGNITIVE TESTS 15

0

.2

.4

.6

.8

1C

umul

ativ

e pr

opor

tion

0 20 40 60 80

Recall score

Raw scores

0

.2

.4

.6

.8

1

Cum

ulat

ive

prop

ortio

n

0 20 40 60 80

Recall score

Linear equating

0

.2

.4

.6

.8

1

Cum

ulat

ive

prop

ortio

n

0 20 40 60 80

Recall score

Mean equating

0

.2

.4

.6

.8

1

Cum

ulat

ive

prop

ortio

n

0 20 40 60 80

Recall score

Equipercentile equating

Figure A2. Parallel but nonequivalent forms: Cumulative probability plots of raw and equated Auditory Verbal Learning Test (AVLT)scores in Alzheimer’s Disease Neuroimaging Initiative (ADNI). Cumulative probability plots overlay distributions of raw or equatedAVLT recall sum scores among mild cognitive impairment (MCI) patients from each ADNI study visit (n = 397). Plots for each visitare not labeled clearly because the purpose of this diagnostic plot is to assess degree of overlap of each visit’s cumulative distribution.Results suggest that equipercentile equating produces more overlap than other methods. As an example of how to interpret this plot, inthe raw scores panel, the gray line shows that about 10% of ADNI MCI participants recalled up to 12 words at 36 months, and 100% ofparticipants at all waves recalled 75 words or less (the test’s ceiling). To view a color version of this figure, please see the online issue ofthe Journal.

Dow

nloa

ded

by [

John

s H

opki

ns U

nive

rsity

] at

09:

55 0

7 M

ay 2

012