An Analysis of Early Numeracy Curriculum-Based Measurement: Examining the Role of Growth In Student...

13
http://rse.sagepub.com/ Remedial and Special Education http://rse.sagepub.com/content/29/1/46 The online version of this article can be found at: DOI: 10.1177/0741932507309694 2008 29: 46 Remedial and Special Education Ben Clarke, Scott Baker, Keith Smolkowski and David J. Chard Student Outcomes An Analysis of Early Numeracy Curriculum-Based Measurement: Examining the Role of Growth in Published by: Hammill Institute on Disabilities and http://www.sagepublications.com can be found at: Remedial and Special Education Additional services and information for http://rse.sagepub.com/cgi/alerts Email Alerts: http://rse.sagepub.com/subscriptions Subscriptions: http://www.sagepub.com/journalsReprints.nav Reprints: http://www.sagepub.com/journalsPermissions.nav Permissions: http://rse.sagepub.com/content/29/1/46.refs.html Citations: What is This? - Feb 15, 2008 Version of Record >> at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013 rse.sagepub.com Downloaded from

Transcript of An Analysis of Early Numeracy Curriculum-Based Measurement: Examining the Role of Growth In Student...

http://rse.sagepub.com/Remedial and Special Education

http://rse.sagepub.com/content/29/1/46The online version of this article can be found at:

 DOI: 10.1177/0741932507309694

2008 29: 46Remedial and Special EducationBen Clarke, Scott Baker, Keith Smolkowski and David J. Chard

Student OutcomesAn Analysis of Early Numeracy Curriculum-Based Measurement: Examining the Role of Growth in

  

Published by:

  Hammill Institute on Disabilities

and

http://www.sagepublications.com

can be found at:Remedial and Special EducationAdditional services and information for    

  http://rse.sagepub.com/cgi/alertsEmail Alerts:

 

http://rse.sagepub.com/subscriptionsSubscriptions:  

http://www.sagepub.com/journalsReprints.navReprints:  

http://www.sagepub.com/journalsPermissions.navPermissions:  

http://rse.sagepub.com/content/29/1/46.refs.htmlCitations:  

What is This? 

- Feb 15, 2008Version of Record >>

at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from at Fund Diag.Est Imstico PARENT on October 11, 2013rse.sagepub.comDownloaded from

46

An Analysis of Early Numeracy Curriculum-Based MeasurementExamining the Role of Growth in Student Outcomes

Ben ClarkeScott BakerPacific Institutes for Research, Eugene, Oregon

Keith SmolkowskiOregon Research Institute, and Abacus Research, LLC, Eugene, Oregon

David J. ChardSouthern Methodist University, Dallas, Texas

Three important features to examine for measures used to systematically monitor progress over time are (a) technical featuresof static scores, (b) technical features of slope, and (c) instructional utility. The purpose of this study was to investigate thetechnical features of slope of four early numeracy curriculum-based measures administered to kindergarten students.Approximately 200 students were assessed at the beginning and end of the school year, and a subset of students (n = 55) wasassessed three additional times during the year. Growth curve analysis was used to model growth over time. The contribu-tion of slope to explaining variance on a criterion measure was examined for the curriculum-based measures that fit a lineargrowth pattern. Implications are discussed regarding monitoring progress in early mathematics.

Keywords: mathematics—assessment and instruction; screening; early numeracy; progress monitoring

Educators and the general public have been concernedabout the low level of mathematics performance of

American students in terms of national standards as wellas in international comparisons (National ResearchCouncil, 2001) for a number of years. Problems in mathe-matics achievement are particularly troubling given thepervasive achievement gap faced by students from low-income and minority backgrounds (National Assessmentof Educational Progress [NAEP], 2005). Concern aboutstudent achievement in mathematics and an increasedrecognition that mathematics skill will play an increasingrole in life opportunities and outcomes led the NationalResearch Council (2001) to proclaim, “All youngAmericans must learn to think mathematically, and theymust think mathematically to learn” (p. 16).

One potential approach to improving math achievementis the delivery of early intervention services to studentswho are at risk in mathematics. Preventing academic dif-ficulties through focused early instruction and responsiveinterventions geared to the needs of learners who strug-gle early is garnering increased attention in both general

education and special education circles. Consistent find-ings demonstrating that remediating academic problemsonce they have emerged becomes increasingly difficult thelonger the problems are unresolved has led to researchefforts to identify critical variables that predict whichstudents may be at risk for later academic difficulties (Lyonet al., 2001).

Key Principles of Early Intervention

There are a number of key principles that form thefoundation of early intervention. First, early intervention

Remedial and Special EducationVolume 29 Number 1

January/February 2008 46-57© 2008 Hammill Institute on

Disabilities10.1177/0741932507309694

http://rase.sagepub.comhosted at

http://online.sagepub.com

Authors’ Note: This research was supported by Grant No.R305K0420081, “Early Learning in Mathematics: A PreventionApproach,” funded by the U.S. Department of Education Institute ofEducation Sciences. The views expressed within this article are notnecessarily those of the USDE. The authors wish to acknowledge theSpringfield, Oregon, Public School District for their assistance with thisresearch. Please address all correspondence regarding this article to Dr. Ben Clarke, Pacific Institutes for Research, 1600 Millrace Dr., Suite 111, Eugene, OR 97403; e-mail: [email protected].

requires tools that enable the effective screening ofstudents. In early intervention, screening is designed todetermine the level of risk a student faces for developinga problem in the future. Second, once the student is iden-tified, the student receives some form of intervention ordifferentiated instruction of varying intensity based onthe severity of need. Third, student response is monitoredto determine the effectiveness of the intervention as it isdelivered. If student progress is not sufficient, it may benecessary to implement further changes in instruction.The concept underlying additional instructional changesis to systematically increase the intensity of instructionuntil learning progress or growth, as determined by theregular monitoring of achievement, reaches a rate ofacceleration that is considered sufficient for there to be asystematic reduction in the degree of risk the studentfaces for long-term problems.

Early Mathematics and Curriculum-Based Measurement

A number of researchers have begun to investigate mea-sures of early mathematics that are constructed to meetthe design parameters of curriculum-based measurement(CBM) (Deno, 1989).

CBM has been used in education since the 1970s(Shinn, 1989) and represents one of the strongest groupsof measures capable of identifying at-risk students andmonitoring student progress. CBM measures are charac-terized by four features: (a) the measures have the nec-essary psychometric characteristics for reliable and validmeasurement; (b) the measures are quick to administerand are thus feasible for regular use in school settings;(c) the measures have multiple alternate forms for frequentadministration; and (d) the measures are sensitive to smallchanges in student performance, which are linked, at leastin part, to learning in the area being assessed that extendsbeyond the specific skill represented by the measure. Thelink between growth on the CBM measure and greaterunderstanding of the underlying domain has lead to thedescription of CBM as General Outcome Measurementor GOM. For example, Oral Reading Fluency (ORF) isconsidered a GOM measure because growth on ORF rep-resents growth on the larger domain or general outcomeof reading performance (Shinn, 1998).

CBM measures of mathematics for elementary schoolstudents typically require the completion of computationalor conceptual problems from the student’s grade level.Because students in preschool, kindergarten, and to alesser extent first grade are not yet completing formalmathematics problems, researchers have focused on othervariables to investigate as potential CBM measures. Critical

variables for study are often related to a student’s grow-ing understanding of number, or number sense. Althoughmultiple definitions of number sense have been offeredand the construct is not fully articulated (Berch, 1998;Dehaene, 1997), Gersten and Chard (1999) provided adefinition that targeted some important central features:“a child’s fluidity and flexibility with numbers, the senseof what numbers mean, and an ability to perform mentalmathematics and to look at the world and make compari-sons” (p. 20). Berch (2005) suggested children displayedaspects of number sense when they demonstrated theability to make numerical magnitude comparisons, as wellas more general abilities such as using numbers and quan-titative methods to “communicate, process, and interpretinformation” (p. 334). He articulated 30 component skillsin young children that could be hallmarks of number sense.Many of the skills he noted form the basis for measuresdesigned by a group of researches investigating potentialCBM measures of early mathematics. Floyd, Hojnoski,and Key (2006) conducted a literature review and identi-fied seven studies (including two by the authors of thisarticle) that examined potential early mathematics CBMmeasures and noted that across the studies multipleresearchers designed measures to assess some of the skillsidentified by Berch, such as oral counting, number iden-tification, and counting objects.

Evaluating CBM Measures

A set of research principles should frame advancedanalysis of CBM measures, including examining therelationship between growth on a particular measure andgrowth in the larger domain. Fuchs (2004) provided a setof guiding principles for the evaluation of CBM measures,involving three stages of CBM research with measuresused to screen for risk and monitor progress. In Stage 1,researchers examine and evaluate the features of staticscores (i.e., a score at one point in time) to investigate thepotential for measures to be used as screeners. For example,researchers investigate how well a score on an experi-mental measure is related to scores on a criterion measureadministered at the same time (i.e., concurrent validity)and at a later point in time (i.e., predictive validity).Predictive validity data from Stage 1 may provide evidencethat the measure may be useful as a screener (i.e., iden-tifying who is at risk). Also during Stage 1, the reliabilityof the experimental measure is investigated. Floyd et al.(2006) noted that the studies of early mathematics mea-sures completed thus far dealt primarily with Stage 1research. Descriptive statistics collected during Stage 1can also be used to hypothesize that the measures mightbe capable of monitoring growth. This typically occurs

Clarke et al. / Early Numeracy Curriculum-Based Measurement 47

by looking at mean scores over time (e.g., from fall tospring) or across ages (e.g., from 3 to 4). However, furtherevidence is needed to determine suitability for progressmonitoring. Questions about the measure’s ability to beused in progress monitoring are investigated in Stage 2and 3 research.

In Stage 2, the technical features of slope are investi-gated. The major question is, Does an increase in slope,or increasing scores on the CBM measure, correspondwith an increase in the student’s performance in the over-all academic domain? At present, none of the early mathe-matics measures have demonstrated the capacity to beused in progress monitoring (Floyd et al. 2006). In Stage3, the instructional utility of the measure is evaluated.The major question is, Does the use of the measure enableeducators to improve instructional decision making andconsequently result in an increase in student achievement?For example, if a student is not making sufficient progresson the measure, do instructional changes occur that resultin an acceleration of the student’s rate of learning?

Early Numeracy—CBM

We have developed and investigated the technical fea-tures of a set of curriculum-based early numeracy mea-sures for use in kindergarten and first grade over the past4 years. The measures were designed to reflect criticalcomponents of students’ understanding of mathematicsand number sense as they begin school prior to beginningthe study of formal mathematics.

Clarke and Shinn (2004) first tested four measures—oral counting (OC), number identification (NI), quantitydiscrimination (QD), and missing number (MN)—withfirst-grade students. Data were collected at three timepoints (i.e., fall, winter, and spring); thus the researchconducted was Stage 1 research and was primarily con-cerned with determining whether the measures could beused in screening. Each measure was timed for 1 minute.The OC measure required students to rote count from 1 ashigh as they could before making an error; the NI mea-sure required students to identify randomly presentednumerals between 1 and 20; the QD measure requiredstudents to identify the bigger number from a pair ofnumbers between 1 and 20; and the MN measure requiredstudents to identify a missing number from a sequence ofthree consecutive numbers in either the first, middle, orlast position. Predictive validity correlations with theNumber Knowledge Test (NKT) and the Woodcock-Johnson Applied Problems subtest (WJ-AP; Woodcock &Johnson, 1989) for the fall to spring period. Interrater,alternate form, and test-retest reliability were all greaterthan .80.

Chard et al. (2005) and Clarke, Baker, Chard, Braun,and Otterstedt (2006) extended the initial work of Clarkeand Shinn (2004) with a kindergarten sample and repli-cated the initial study with another first grade sample. Thekindergarten measures were modified to include onlynumbers between 1 and 10 in the QD, MN, and NI tasks.The criterion measures in the spring were the StanfordEarly School Achievement Test (SESAT) and the NKT.Predictive validity correlations between the experimentalmeasures, administered in the winter, and spring crite-rion measures ranged from .62 to .67 for the kindergartensample. Predictive validity correlations for the experi-mental measures, administered in the winter, to thespring criterions ranged from .35 to .55 for the first-gradesample. Because data were collected at three points in time(i.e., fall, winter, and spring), the capacity of the measuresto model growth over time was also assessed by examin-ing raw scores at each time point. Average student per-formance at each measurement point increased over thecourse of the year, thus demonstrating initial potential tomeasure growth. However, it is unclear whether growthon the experimental measures was reliable or whetherslope accounted for variance that could be reliably predicted on a comprehensive outcome measure. Theresearch conducted in the Clarke and Shinn, Chard et al.,and Clarke et al. studies falls under the category of Stage1 research in the Fuchs (2004) taxonomy.

We consider an important distinction among the fourCBM early numeracy measures to be the extent to whichthey measure mathematical knowledge. OC and NI func-tion as gateway skills that enable a child to do mathematics.A parallel is the role of letter knowledge in beginningreading and corresponding measures that assess letter-naming fluency as an indicator of risk for reading problems.Although letter-naming fluency has value in predictingrisk, it does not have as much strategic value as an instruc-tional target as does a measure of phonological aware-ness. The mathematical demands or knowledge requiredby OC and NI may not be significant, and thus growth onthese measures may not represent increased understandingof the larger domain of mathematical knowledge. In con-trast, QD and MN assess critical components of a child’snumber sense. A child has to apply his or her emergingknowledge of mathematical relationships to complete thetask successfully.

QD involves making magnitude comparisons.Magnitude comparisons can be as simple as a 2-year-oldrecognizing that one set of objects has more than anotherset (Feigenson, Dehaene, & Spelke, 2004) or as complexas understanding that 54 + 14 is greater than 47 + 16.Griffin, Case, and Siegler (1994) found that magnitudecomparisons or judgments about magnitude are often

48 Remedial and Special Education

taught (informally but explicitly) in upper and middle-income homes and rarely taught in low-income homes.Their research found that preschool students when given amagnitude comparison problem (i.e., answering, “whichnumber is bigger”) from high-socioeconomic status (SES)homes answered correctly 96% of the time in comparisonto 18% of the time for children from low-SES homes.

In kindergarten and first grade, the ability to make mag-nitude comparisons involving numerals involves a criticallink to effective and efficient counting strategies to solveproblems. MN serves as a measure of strategic counting.For example, to solve the problem 7 + 2, a student with skillin magnitude comparison would identify 7 as the biggernumber and then apply the “min” strategy and count onfrom the larger addend (7) by the smaller addend (2) toarrive at the correct answer. Applying the min strategyreflects both being able to make a magnitude comparisonand strategic counting (i.e., counting on from the greateraddend) to solve the problem. Students who do not yethave this ability would be forced to use less efficient andeffective counting strategies (e.g., counting all) to solvethe problem. Measures like QD and MN directly assesssingle skills, and they may also function as broader indi-cators of overall mathematics knowledge. Evidence fortheir ability to serve as indicators of broader mathematicsknowledge was demonstrated by their correlations withcriterion measures of math performance (Chard et al.,2005; Clarke & Shinn, 2004). However, it has not beendemonstrated whether improvement on these earlynumeracy measures reflects broader improvements ingeneral mathematics knowledge or number sense.

Purpose of Study

For each of the CBM early numeracy measures, ourprimary interest was investigating whether growth on themeasure over time provided information about learningthat was not obtainable from information derived from astatic measure of performance. A static measure of perfor-mance is a measure of performance at a single point intime that is typically correlated with some outcomemeasure. For example, a correlation between a single-point-in-time screening measure in the fall and an out-come measure administered in the spring provides anestimate of the degree to which performance on thescreener is able to predict performance on the outcomemeasure. If progress is monitored on the measure duringthe year, then a measurement of slope can be computed.Whether slope adds to prediction accuracy of the criterionor outcome measure, above and beyond information deri-ved from the static screening performance score in thefall, is a central focus for this study.

In this study, the criterion measure was performanceon the SESAT administered at the end of the year. To testthese hypotheses about the association among measures,we first investigated whether data from the early numeracymeasures fit a linear growth pattern. Although the datamay fit more complicated growth patterns, such as qua-dratic growth, nonlinear models do not readily lend them-selves to instructional interpretations. Thus, we first testedwhether the change on the early numeracy measures fit alinear growth model. For measures that fit a linear growthmodel, we then examined what value slope added in pre-dicting a comprehensive measure of math performanceat the end of the kindergarten year.

Testing the unique contribution of the slope of eachmeasure, over and above a static, intercept value, repre-sents a robust test of the merit of the slope in the predic-tion of the criterion. This is true for a number of reasons.First, change on the predictor measure must occur if slopeis going to make a unique contribution to predicting perfor-mance on the criterion measure. Second, this procedurerequires that slope be important beyond merely noting itsoccurrence or that it occurs reliably over short periods oftime. Third, slope must provide information about learningthat goes beyond the information attained by the exami-nation of performance at multiple single-point-in-timeassessments.

Three types of predictors were used to estimate end-of-year performance on the SESAT. The predictor mea-sures included two static measures of performance aswell as the CBM measure of slope over the course of theyear. The two static measures were (a) performance inthe fall on the SESAT and (b) performance in the fall on therelevant CBM early numeracy measure.

Two formal research questions were posed. The firstwas as follows: After controlling for initial level of per-formance on the CBM measure in the fall, would slope overtime on the CBM measure account for additional varianceon the predicted SESAT score at posttest? On this question,we expected that slope on the two CBM early numeracymeasures hypothesized to represent the broader domainof mathematical knowledge—QD and MN—wouldaccount for more variance on the SESAT at posttest thanwould OC and NI. Growth on QD and MN should repre-sent a stronger degree of real improvement in the devel-opment of mathematics knowledge and understanding.Because OC and NI assess more specific and rote mathe-matical skills, it is reasonable to hypothesize that improve-ment over time on these assessment tasks might occurfor reasons entirely unrelated to the development of mathknowledge. Although we predicted these differencesamong the measures, we also expected that slope on allfour experimental measures would provide reliable and

Clarke et al. / Early Numeracy Curriculum-Based Measurement 49

50 Remedial and Special Education

meaningful information on which to predict performanceon the SESAT at posttest.

The second question was, Given all three predictors—SESAT pretest, CBM level, and CBM slope—what com-bination of variables best fit the predicted performanceon the SESAT in the spring (end of year)? We expectedSESAT performance in the fall would provide the high-est zero-order correlations with SESAT performance inthe spring and would be included in the best-fitting models.Because of the unique contribution of slope as an indexof growth above and beyond level of performance, wealso expected that slope on the CBM measure would beincluded in the best-fitting model. Because both CBMlevel of performance and the fall SESAT represent staticmeasures of performance at the beginning of the year, weexpected both of them would provide similar informa-tion. We did not predict that information provided byCBM level would add additional information above theinformation provided by the SESAT at pretest. Thus, wedid not predict that CBM pretest would be included inthe best-fitting models, and we predicted that a modelwith fall SESAT and CBM slope would fit the underlyingdata better than a model that includes SESAT, CBM level,and CBM slope.

Method

Participants

Participants in this study were 254 kindergartenstudents from 14 schools in a medium-size school districtof approximately 11,000 students located in the PacificNorthwest. The percentage of students on free and reducedlunch in the district is 52%. Fourteen percent of studentswere minorities, and approximately 6% were English-language learners.

Participants in the study were part of a research projectto examine the effects of a kindergarten mathematics pro-gram. Students were nested within classrooms and class-rooms were nested within schools. Schools were randomlyassigned to condition. Because intervention effects arenot the center of this article, and because previous analy-ses failed to demonstrate differences between interven-tion conditions for early numeracy or other measures, wediscuss analyses and results associated with the interven-tion only briefly and within the context of the growthmodels described below.

From the total sample of students, a subsample of 8 children per classroom was identified for additionalassessments and targeted observations during instruc-tion. These students were selected on the basis of theirperformance on the NKT at pretest. All children in each

participating classroom were assessed on the NKT, andwithin each classroom students were ranked from high tolow based on their raw scores. Equal groups of high-,moderate-, and low-performing students were formedwithin each classroom. For example, in a classroom of24 students, high, moderate, and low groups consisted of8 students. In the moderate and low groups, 4 studentswere selected randomly. These students were adminis-tered the four experimental measures three additionaltimes between pretest and posttest. At the beginning ofthe study, 64 students were in the moderate group and 60students were in the low group. Final sample sizes for thestudents who participated in the additional assessmentswere 56 in the moderate group and 55 in the low group.On the basis of the other pretest measures, students in themoderate group were significantly higher performingthan students in the low group on the SESAT and exper-imental measures. Also, analysis of performances differ-ences of initial and final samples indicated studentattrition did not impact the results.

Measures

We administered four experimental measures, one cri-terion measure of math performance, and a measure usedto classify students for instructional purposes. The experi-mental measures and the criterion measures were adminis-tered at the beginning and end of the 2005-2006 academicyear to all students in the study. A subset of these studentswas also administered the four experimental measures atthree additional time points during the year.

All experimental measures were 1-minute fluencymeasures. If a student struggled or hesitated to correctlycomplete an item for 3 seconds, they were told to “try thenext one.” Scores used for the final analysis were thenumber of correctly completed items. Validity informationreported for the experimental measures below are fromClarke et al. (2006) for NI, QD, and MN and from Clarkeand Shinn (2004) for OC.

Oral counting measure. The experimental OC measurerequired participants to count from 1 as high as they could.Score on the measure was the number of correct countscompleted before an error was made or, if no error wasmade, the number of correct counts in 1 minute. Concurrentvalidity correlations ranged from .49 to .70 and predictivevalidity correlations ranged from .46 to .72 with first-grade students.

Number identification measure. The experimental NImeasure required participants to identify orally numeralsbetween 0 and 10. Participants were given two pages ofrandomly selected numerals formatted in an 8 by 7 grid

Clarke et al. / Early Numeracy Curriculum-Based Measurement 51

(i.e., 56 numerals per page). Concurrent and predictivevalidity correlations for NI ranged from .45 to .65.

Quantity discrimination measure. The experimentalQD measure required participants to name the larger oftwo visually presented numerals (one number was alwaysnumerically larger). Participants were given a sheet ofpaper with a grid of individual boxes. Each box includedtwo randomly sampled numerals from 1 to 10. QD con-current and predictive validity correlations ranged from.52 to .71.

Missing number measure. The experimental MN mea-sure required students to name the missing numeral froma string of numerals between 0 and 10. Kindergartenstudents were given a sheet with 21 boxes on it. In theboxes were strings of three numerals with the first, mid-dle, or last numeral of the string missing. The student wasinstructed to orally state the numeral that was missing.MN concurrent and predictive validity correlations rangedfrom .51 to .64.

Stanford Early School Achievement Test–Fourth Edition(SESAT-2). At posttest, kindergarten students were adminis-tered the criterion measure SESAT-2 (Harcourt BraceEducational Measurement, 1996b). The SESAT-2, thekindergarten version of the Stanford Achievement Test–Ninth Edition (SAT-9; Harcourt Brace EducationalMeasurement, 1996a), has been reported to be well alignedwith the objectives of curricula that stress developmentof number sense and a conceptual understanding of arith-metic concepts. The SESAT-2 is a standardized achieve-ment test with adequate and well-reported estimates ofvalidity (r = .64) and reliability (r = .88). The kinder-garten measure comprises one subtest that includes arange of skills. Coverage areas include vocabulary (e.g.,more, less, most), counting, dividing a whole into parts,sequencing, and single-digit addition and subtraction.SESAT standard scores were used in the analyses.

Number Knowledge Test. The NKT (Okamoto & Case,1996) was used as a classification measure to classifychildren in terms of their math skills at the beginning ofthe study. This measure has been used to chart children’sdevelopmental profiles of numerical competency(Okamoto & Case, 1996) and to study the effect of mathinstruction on kindergarteners from low-SES families(Griffin, 1998). The NKT test contains four levels, andstudents are required to obtain a minimum number ofcorrect responses at one level to move to the next level.On Level 1, students are required to complete tasks such

as counting chips and geometric shapes. Level 2 requiresstudents to do tasks such as identifying bigger or smallernumbers from a pair, naming numbers, and solving simpleaddition and subtraction problems. Level 3 requiresstudents to solve similar problems to those of Level 2,but with larger numbers. Level 3 also requires students tocomplete new items such as stating how many numbersare between a pair of numbers. Level 4 is a more difficultversion of Level 3 and also adds new tasks such as tellingwhich difference between two pairs of numbers is biggeror smaller. The NKT was used in our previous work(Baker, Gersten, Katz, Chard, & Clarke, 2002) and wasfound to be highly correlated with a published measureof mathematics achievement (i.e., SAT-9) with a kinder-garten sample.

Data Collection

All data were collected by examiners with a back-ground in early childhood assessment. Explicit trainingwith guided practice was provided in the administrationand scoring of the experimental and criterion measures.Data collectors were observed administering and scoringeach measure, and appropriate feedback was provided.Feedback included what to do in cases where the studentfailed to supply an answer (i.e., wait 3 seconds and thenask the student to move to the next stimulus) or when thestudent skipped items (i.e., mark them as incorrect). Alldata collectors were required to obtain interrater reliabilitycoefficients of .95 prior to collecting data with students.Follow-up trainings were conducted prior to each datacollection period to ensure continued reliable data col-lection. The experimental and NKT required individualtest administration. The SESAT was administered togroups of between 15 and 25 children.

Data Analysis

Data analyses included (a) descriptive statistics andcorrelations; (b) assessment of the fit of the data to linearslope models; and (c) prediction models, where we pre-dict spring SESAT scores with intercept and slope fromthe early numeracy measures and the fall SESAT. In thenext two sections we discuss methods for the assessmentof linear slope models and prediction models.

Assessment of linear growth. Growth curve analyseswere conducted to test whether the sample data fromeach early numeracy measure fit a linear growth pattern.For each measure, we fit the data to a growth model that nests individual assessments within students andestimates the initial value and the slope (see Note 1;

52 Remedial and Special Education

Li, Duncan, McAuley, Harmer, & Smolkowski, 2000;Singer & Willett, 2003). We tested growth models andtests of condition within SAS PROC MIXED (SASInstitute, 2005). We estimated the proportion of variationin individual assessments accounted for by linear growthusing a pseudo-R2 statistic (Singer & Willett, 2003).PROC MIXED models were estimated with restrictedmaximum likelihood.

Next, we fit the same models with Mplus (Muthén &Muthén, 1998–2004), a statistical software package builtfrom a structural equation modeling framework that usesfull maximum likelihood estimation. Mplus provides fitstatistics that test how well the data for each measure fita linear growth model. Model fit was tested using severalfit statistics (Bollen & Long, 1993; Burnham & Anderson,2002). We focused primarily on the comparative fit index(CFI) (Bentler, 1990; Gerbing & Anderson, 1993; Marsh,1995), and the Tucker-Lewis Index (TLI) (Bollen, 1989;Tucker & Lewis, 1973). Hu and Bentler (1999) recom-mended “a cutoff value close to .95” (p. 27) for the CFIand TLI. They also recommended a value close to .08 forthe standardized root mean square residual (SRMR).

We report the root mean square error of approxima-tion (RMSEA). RMSEA values below .05 have been tra-ditionally recommended to indicate acceptable fit, butrecent research has suggested more relaxed criteria andhas criticized “rigid adherence to fixed target values”(Steiger, 2000, p. 151). Thus, we adopted .10 as ourRMSEA target value for acceptable fit. Models that failedto explain most of the assessment variation or that signifi-cantly departed from these fit criteria were not consideredfor the prediction models.

The growth models included all cases with data at anytime point. In this particular sample, approximately half ofthe participants provided data only at the initial and finalassessment. Still others missed individual assessments.Maximum likelihood models make use of all availabledata from all participants and reduce bias (Laird, 1988;Nich & Carroll, 1997). Because students with the interimassessments were selected at random from the populationof students, we have assumed that the missing data weremissing at random or ignorable (Little, 1995; Little &Rubin, 1987; Singer & Willett, 2003).

Prediction models. For measures whose data fit a lineartrajectory, we constructed a set of models that predictedperformance on the spring SESAT, our criterion. Thesemodels compared three predictors of the performance onthe SESAT administered at the end of the year: the CBMintercept in fall, the CBM slope during the year, and thefall SESAT. Because we used the intercept and slope as

predictors, these models were fit with Mplus (Muthén& Muthén, 1998–2004).

We are interested in models that compare the additionof slope as a predictor to models with static, pretest pre-dictors. Akaike’s information criterion (AIC) provides anindex of relative fit of the data among a set of competingmodels specified a priori (Akaike, 1974; Burnham &Anderson, 2002). The minimum AIC among the set ofmodels tested identifies the most generalizable or replic-able model (Burnham & Anderson, 2002; Myung, 2000).From the raw AIC value for each model, we computed aΔ AIC value by subtracting the AIC for the best-fittingmodel from the AIC for each other model. Thus, thebest-fitting model has a Δ AIC of 0.0, and low Δ AIC val-ues indicate greater support relative to the best-fittingmodel. Values of 2.0 or below indicate very competitivemodels, and values greater than 10.0 indicate little sup-port compared to the best-fitting model (Burnham &Anderson, 2002). We used the small-sample bias-adjusted version of the AIC called the AICc (Burnham &Anderson, 2002, p. 66).

The AICc was used to compare different predictorsets. In testing the predictive value for each measure, onemodel included only the SESAT pretest; a second modelincluded the early numeracy measure’s intercept andslope; a third model included the intercept, slope, and thefall SESAT; and a fourth tested the fall SESAT and earlynumeracy measure’s slope. Models that tested the earlynumeracy measure intercept, alone, or the intercept andfall SESAT were not of substantial interest. While we usedthe AICc to select the most generalizable model(s), wealso report the usual R2 value, the proportion of varianceaccounted for in the criterion by the predictor. The largestR2 values may not be associated with the best-fitting model,as the AICc values choose the most reliable model, over-all, whereas the R2 value reports the proportion of variancein the criterion accounted for by predictors unadjustedfor the reliability of the overall model. That is, modelswith high R2 values may have overfit the data, and thusmay not represent the most generalizable model.

Results

Performance on the pretest and posttest administrationof the SESAT and experimental measures and the pretestadministration of the NKT are presented in Table 1. Onthe experimental measures, pretest performance is higheron the measures that require less math knowledge (OC andNI). On these measures, distributions are more normal atpretest than the distributions of QD and MN (i.e., fewer

Clarke et al. / Early Numeracy Curriculum-Based Measurement 53

students score zero). At posttest, the distributions on allfour experimental measures are more normal.

In Table 2, we present correlations between the exper-imental and SESAT measures at pretest and posttest.Correlation coefficients range from .50 to .64, indicatinga moderate association between the measures. Of primaryinterest is the ability of the experimental measures topredict performance on the SESAT at posttest. These arethe correlations between the experimental measures atpretest and the SESAT measure at posttest. This demon-strates the degree to which the experimental measures areable, on the surface, to function as a screener for mathdifficulties. The correlations are moderate in magnitude.On three of the four measures, the correlations betweenthe experimental measures and the posttest SESAT wereslightly higher on the posttest administration of the experi-mental measures versus the pretest administration, but thedifferences are very slight.

Assessment of Linear Growth

We attempted to fit the data from each of the earlynumeracy measures to a linear growth model. The results

indicated that OC, NI, and MN did not fit a linear growthcurve. The QD measure, however, was more promising.

The intercept and slope for OC explained 73% of thevariation in the underlying data, using the pseudo-R2 sta-tistic (Singer & Willett, 2003). Although there is no pre-defined, ideal value for this pseudo-R2 statistic, forcomparison purposes Baker et al. (n.d.) showed that agrowth model on ORF accounted for 95% of the under-lying variation in individual assessments. The fit statis-tics for OC also fell well below the desired .95 level,with a CFI of .77 and TLI of .83. We did not attempt tofit prediction models for OC. The SRMR was .15, as wasthe RMSEA.

The data from the NI measure also fit the growth mod-els relatively poorly, with CFI value of .80 and TLI valueof .86. The SRMR and RMSEA values were both .16,well above the criterion levels. The pseudo-R2 indicatedthat the intercept and slope accounted for 75% of thevariation in the individual assessments.

The growth model for the MN measure accounted for79% of the variation in underlying assessments. WithinMplus, the data did not fit the growth model acceptably.The CFI value was .84 and the TLI value was .89. TheRMSEA estimate was .15 and the SRMR value was .11.

QD performed better than the other early numeracymeasures. Measures of model fit approached the Hu andBentler (1999) recommended .95 value, with a CFI valueof .92 and a TLI value of .94. The growth model for QDalso accounted for 83% of the variation in the individualmeasures. Although the RMSEA, at .11, was above therecommended .10 value, the SRMR value was .08, thevalue recommended by Hu and Bentler. The intercept forQD was estimated at 9.4, with a variance of 69.0, and theslope across the five assessments was 3.3, with a varianceof 2.7. Thus, students began with a score of approximately9, and improved by just over 3 correct responses at eachassessment thereafter. The intercept and slope were uncor-related (r = –.02).

Table 1Descriptive Data on Complete Student Sample at Pretest and Posttest

Pretest Posttest

Measure N M SD N M SD

Stanford Early School Achievement Test (SESAT) 248 20.63 7.28 221 27.48 7.60Number Knowledge Test (NKT) 230 9.59 4.65Experimental measures

Oral counting 230 22.46 17.59 222 54.38 25.12Number identification 230 28.13 19.84 222 49.03 18.66Quantity discrimination 230 8.99 9.41 222 22.29 11.48Missing number 230 3.85 4.86 222 11.11 6.89

Table 2Correlations Between the Experimental and

Stanford Early School Achievement Test (SESAT) Measures at Pretest and Posttest

Experimental measures SESAT Pretest SESAT Posttest

PretestOral counting .59 .55Number identification .53 .58Missing number .60 .57Quantity discrimination .62 .60

PosttestOral counting .50 .55Number identification .52 .61Missing number .58 .64Quantity discrimination .53 .62

54 Remedial and Special Education

Prediction Model Results

QD was the only measure that fit well enough to usein prediction models. The best-fitting prediction modelincluded all three predictors: SESAT pretest and the leveland slope of performance on QD. This model had aΔAICc of 0.0. This model produced an R2 of .67, mean-ing the predictors explained 67% of the variance in thespring SESAT. The standardized regression weights were.43 for the fall SESAT, .45 for the QD intercept, and .14for the QD slope. All three predictors were statisticallysignificant (p < .01). The QD intercept and sloperemained uncorrelated (r = –.03). Thus, whereas theslope remained the weakest predictor, it nonetheless con-tributed to the prediction of the criterion.

The other three models of interest did not fit compet-itively when compared to the best-fitting model. Themodel with only the SESAT pretest produced a Δ AICcof 50.6, the model that included only the QD interceptand slope received a Δ AICc of 16.7, and the model withthe fall SESAT and QD slope scored a Δ AICc of 21.0.As these models were not competitive with the best-fit-ting model, parameter estimates for these models willnot be discussed.

The best-fitting model for the QD CBM measure ispresented in Figure 1. This model shows the five observedCBM QD assessments, the CBM intercept, and slope

(circles). The relations between the constructs of mostinterest are depicted in Figure 1: SESAT in the spring ofkindergarten predicted by (a) the CBM QD intercept, (b)the CBM QD slope, and (c) SESAT in the fall of kinder-garten. This portion of the model has an interpretationsimilar to a standard regression analysis. To evaluate thecompetition between predictors, we obtain standardizedestimates of the regression coefficients and the varianceexplained in the spring SESAT from Mplus as the usualR2 value. The complete model also assumes correlationsbetween the fall SESAT, CBM QD intercept, and CBMQD slope, denoted by curved lines (paths have beenomitted to simplify the figure).

Discussion

The increased interest and implementation of instruc-tional delivery models that incorporate progress moni-toring will lead to investigations of measurementapproaches that examine the technical features of mea-sures purporting to measure learning or change over time.In school settings, efficient measures of administration,such as CBM, should be investigated on a number ofdimensions. In this study, we examined a series of singleskill fluency measures in early numeracy that are associ-ated with early proficiency skills in mathematics, asdemonstrated by their correlations with criterion mea-sures of math performance (Chard et al., 2005; Clarke &Shinn, 2004). Our goal was to examine whether changein performance on these 1-minute fluency-based mea-sures over the course of the kindergarten year providedmore accurate predictions of performance on a compre-hensive measure of mathematics performance, above theprediction accuracy derived from single-point-in-time mea-sures used to screen students at the beginning of the year.

Rate of growth over time, as an index of student learn-ing, should be a central feature of evaluating the effec-tiveness of an instructional program. It is lack of growththat forms the cornerstone of response to intervention(RTI) approaches to identifying and diagnosing a learn-ing disability. In school settings, measures used to mon-itor progress may represent important growth targets ontheir own, or they may represent correlates of growth orperformance on other, more important outcome mea-sures that are not capable of being administered on a reg-ular basis. In reading, oral reading fluency represents animportant skill in the development of successful reading(National Reading Panel, 2000), and it is associated withthe development of other critical skills and knowledge inareas of reading such as reading comprehension. For theearly numeracy measures investigated in this study, it istheir association with performance on comprehensive

Note: SESAT = Stanford Early School Achievement Test; CFI =comparative fit index; TLI = Tucker-Lewis Index; SRMR = stan-dardized root mean square residual; T = assessment time.

Figure 1Model Depicting Relations Between Quantity

Discrimination (QD) Intercept, QD Slope, and FallSESAT Predicting Performance on Spring SESAT

measures of math performance that is important. Studentgrowth on these measures is potentially important if it canbe demonstrated that growth is related to the developmentof more comprehensive math skills and knowledge.

In other words, growth on these early numeracy mea-sures needs to represent important learning of math knowl-edge and skill. If students grow on these measures duringthe kindergarten year, but that growth is not related toimportant learning in the mathematics domain, thenconclusions about learning based on observed change inperformance on the early numeracy measures will bemisleading. The potential for misleading conclusionsbecomes particularly critical if inferences are made thatless than acceptable rates of change may indicate theneed to change an instructional program or, in cases ofRTI special education models, the presence of a learningdisability.

Findings from this study and previous studies provideevidence that each of the four early numeracy measurescan help screen students for problems in early mathe-matics. Correlations between the experimental measuresin the fall and criterion measures in the spring are in themoderate range and similar to the correlations for otherearly screening measures in mathematics (Floyd et al.,2006). Schools can efficiently administer these measuresto all students in kindergarten and reliably identifystudents who will benefit from additional instructionalsupport to achieve critical outcomes in mathematics.Conclusions about how well the measures function whenthey are also used to monitor the progress of studentsover time to determine the effectiveness of extra instruc-tional support are less clear. Of the four experimentalmeasures, only growth on QD accounted for additionalvariance on SESAT performance at the end of the yearabove the variance counted for by the screening mea-sures alone. We believe the standard used to determine ifslope was a significant contributor in explaining varianceon the SESAT outcome measure was high. The screeningmeasures included performance on the specific earlynumeracy measure and also included performance on thepretest administration of the SESAT. Thus, slope had tocontribute variance above and beyond the varianceaccounted for by both of these measures.

Nonetheless, it is interesting to consider why QD, andto a lesser extent MN, functioned better as measures ofchange than OC and NI. As we noted in the introduction,our hypothesis is that QD and MN require more sophis-ticated math knowledge than OC or NI. The tasksrequired by these measures may serve as a more accurateproxy of a student’s increasing understanding of the under-lying construct of mathematical knowledge. AlthoughOC and NI are important skills for kindergarten students,

how children change in acquiring performance skills onthese measures, at least in kindergarten, appears to rep-resent something other than the development of mathe-matics knowledge. On the other hand, growth on QDmay represent, to some degree, important growth in thedevelopment and understanding of critical early mathe-matics content. Future research should attempt to furtherinvestigate the differences between these types of exper-imental measures and their relationship to mathematicalunderstanding.

An important limitation for this study is the smallsample size. Additional study in this area should be con-ducted with larger sample sizes to increase the confi-dence with which we can interpret findings. Perhaps themost serious limitation to the study would be that wedecided to analyze a priori the measures to determine iftheir growth fit a linear model. Only the QD measure fitthis pattern, and thus only the QD measure was analyzedto see if slope contributed to our prediction of the springcriterion measure. Future research should considerwhether patterns of growth may represent reliable non-linear curves. For example, examining the mean scoresfor NI over time show the data potentially fitting a pat-tern of curvilinear growth (i.e., a sharp increase in meanscore across the first three data points and then smallincreases across the last two data points). It may be thatgrowth in OC, NI, and MN might contribute to predict-ing the outcome but do so in ways that are not capturedby a linear growth model.

As the field of special education considers measuresto evaluate instructional programs and for making spe-cial education eligibility decisions within an RTI frame-work, we encourage robust examinations of the measuresused in the decision-making progress. Our understandingof growth and progress monitoring measures must extendbeyond a simple examination of whether scores on themeasure increase over time. Despite the limitations of thisstudy, we consider the process by which we examined ourmeasures to provide an evaluation framework for design-ers and researchers of progress monitoring measures.

Note

1. For the assessment of linear growth and for the prediction mod-els, we did not nest students within schools. First, we focus on mea-sures of fit and not statistical tests. Second, ignoring nesting has amuch smaller effect for models with only student-level variables.Murray (1998, 2001) has shown that for tests of condition, whenschools are randomized but ignored in the analysis, the design effect(DEFF) is 1 + (m – 1) × ICCy, where m is number of students perschool and ICC is intraclass correlation. When predicting a student-level outcome with student-level predictors, however, Scott and Holt(1982) have shown that the DEFF becomes 1 + (m – 1) × ICCx ×ICCy. Thus, the design effect becomes negligible for small ICCs. In

Clarke et al. / Early Numeracy Curriculum-Based Measurement 55

56 Remedial and Special Education

this study, ICC values were between 0 and .05 for all measures. Withapproximately 18 students per school, DEFF would be a maximum of1.04, which will not influence conclusions.

References

Akaike, H. (1974). A new look at the statistical model identification.IEEE Transactions on Automatic Control, 19(6), 716–723.

Baker, S., Gersten, R., Katz, R., Chard, D., & Clarke, B. (2002).Preventing mathematics difficulties in young children: Focus oneffective screening of early number sense delays (TechnicalReport No. 0305). Eugene, OR: Pacific Institutes for Research.

Baker, S., Smolkowski, K., Katz, R., Fien, H., Seeley, J., Kame’enui, E.,et al. (n.d.). Reading fluency as a predictor of reading proficiencyin low performing high poverty schools: Going to scale with read-ing first. Unpublished manuscript.

Bentler, P. M. (1990). Comparative fit indexes in structural models.Psychological Bulletin, 107(2), 238–246.

Berch, D. B. (1998, April 9–10). Mathematical cognition: Fromnumerical thinking to mathematics education. Paper presented atthe National Institute of Child Health and Human Development,Bethesda, MD.

Berch, D. B. (2005). Making sense of number sense: Implications forchildren with mathematical disabilities. Journal of LearningDisabilities, 38(4), 333–339.

Bollen, K. A. (1989). Structural equations with latent variables. NewYork: Wiley.

Bollen, K. A., & Long, J. S. (Eds.). (1993). Testing structural equationmodels. Newbury Park, CA: Sage.

Burnham, K. P., & Anderson, D. R. (2002). Model selection and mul-timodel inference: A practical information-theoretic approach(2nd ed.). New York: Springer-Verlag.

Chard, D., Clarke, B., Baker, S., Otterstedt, J., Braun, D., & Katz, R.(2005). Using measures of number sense to screen for difficultiesin mathematics: Preliminary findings. Assessment for EffectiveIntervention, 30(2), 3–14.

Clarke, B., Baker, S., Chard, D., Braun, D., & Otterstedt, J. (2006).Developing and validating measures of number sense to identifystudents at risk for mathematics disabilities. Unpublished manuscript.

Clarke, B., & Shinn, M. R. (2004). A preliminary investigation intothe identification and development of early mathematics curriculum-based measurement. School Psychology Review, 33, 234–248.

Dehaene, S. (1997). The number sense: How the mind creates mathe-matics. New York: Oxford University Press.

Deno, S. L. (1989). Curriculum-based measurement and special educa-tion services: A fundamental and direct relationship. In M. R. Shinn(Ed.), Curriculum-based measurement: Assessing special children(pp. 1–17). New York: Guilford.

Feigenson, L., Dehaene, S., & Spelke, E. (2004). Core systems ofnumber. Trends in Cognitive Sciences, 8(7), 307–314.

Floyd, R. G., Hojnoski, R., & Key, J. (2006). Preliminary evidence ofthe technical adequacy of the preschool numeracy indicators.School Psychology Review, 35(4), 627–644.

Fuchs, L. S. (2004). The past, present, and future of curriculum-basedmeasurement research. School Psychology Review, 33, 188–192.

Gerbing, D. W., & Anderson, J. C. (1993). Monte Carlo evaluationsof goodness-of-fit indices for structural equation models. In K. A.Bollen & J. S. Long (Eds.), Testing structural equation models(pp. 40–65). Newbury Park, CA: Sage.

Gersten, R., & Chard, D. (1999). Number sense: Rethinking arithmeticinstruction for students with mathematical disabilities. Journal ofSpecial Education, 33(1), 18–28.

Griffin, S. (1998, April). Fostering the development of whole numbersense. Paper presented at the annual meeting of the AmericanEducational Research Association, San Diego, CA.

Griffin, S., Case, R., & Siegler, R. S. (1994). Rightstart: Providing thecentral conceptual prerequisites for first learning of arithmetic tostudents at risk for school failure. In K. McGilly (Ed.), Classroomlessons: Integrating cognitive theory and classroom practice (pp.24–49). Cambridge, MA: MIT Press.

Harcourt Brace Educational Measurement. (1996a). StanfordAchievement Test (SAT-9) (9th ed.). Orlando, FL: Harcourt, Inc.

Harcourt Brace Educational Measurement. (1996b). Stanford EarlySchool Achievement Test (SESAT-2) (4th ed.). Orlando, FL:Harcourt, Inc.

Hu, L.-t., & Bentler, P. M. (1999). Cutoff criteria for fit indexes incovariance structure analysis: Conventional criteria versus newalternatives. Structural Equation Modeling, 6(1), 1–55.

Laird, P. D. (1988). Learning from good and bad data. Boston:Kluwer Academic.

Li, F., Duncan, T. E., McAuley, E., Harmer, P., & Smolkowski, K.(2000). A didactic example of latent curve analysis applicable tothe study of aging. Journal of Aging and Health, 12(3), 388–425.

Little, R. J. A. (1995). Modeling the drop-out mechanism in repeated-measures studies. Journal of the American Statistical Association,90(431), 1112–1122.

Little, R. J. A., & Rubin, D. (1987). Statistical analysis with missingdata. New York: Wiley.

Lyon, G. R., Fletcher, J. M., Shaywitz, S. E., Shaywitz, B. A., Torgesen,J. K., Wood, F. B., et al. (2001). Rethinking learning disabilities. InC. E. Finn, A. J. Rotherham, & C. R. Hokanson (Eds.), Rethinkingspecial education for a new century (pp. 259–288). Washington, DC:Thomas B. Fordham Foundation and the Progressive Policy Institute.

Marsh, H. W. (1995). The Δ2 and χ2I2 fit indices for structural equa-tion models: A brief note of clarification. Structural EquationModeling, 2(3), 246–254.

Murray, D. M. (1998). Design and analysis of group-randomized trials.New York: Oxford University Press.

Murray, D. M. (2001). Statistical models appropriate for designsoften used in group-randomized trials. Statistics in Medicine,20(9/10), 1373–1386.

Muthén, L., & Muthén, B. (1998–2004). Mplus: User’s guide (3rded.). Los Angeles, CA: Muthén & Muthén.

Myung, I. (2000). The importance of complexity in model selection.Journal of Mathematical Psychology, 44(1), 190–205.

National Assessment of Educational Progress. (2005). The nation’sreport card: Mathematics 2005 (Report No. NCES-2006-453).Washington, DC: U.S. Department of Education, National Centerfor Education Statistics.

National Reading Panel. (2000). Teaching children to read: An evi-dence-based assessment of the scientific research literature onreading and its implications for reading instruction (NIHPublication No. 00-4769). Washington, DC: National Institute ofChild Health and Human Development.

National Research Council. (2001). Adding it up: Helping childrenlearn mathematics. Washington, DC: Mathematics LearningStudy Committee.

Nich, C., & Carroll, K. (1997). Now you see it, now you don’t: A com-parison of traditional versus random-effects regression models in

Clarke et al. / Early Numeracy Curriculum-Based Measurement 57

the analysis of longitudinal follow-up data from a clinical trial.Journal of Consulting and Clinical Psychology, 65(2), 253–262.

Okamoto, Y., & Case, R. (1996). Exploring the microstructure ofchildren’s central conceptual structures in the domain of number.In R. Case, Y. Okamoto, S. Griffin, R. S. Siegler, & D. P. Keating(Eds.), The role of central conceptual structures in the develop-ment of children’s thought: Monographs of the Society forResearch in Child Development (pp. 27–58). Chicago: Society forResearch in Child Development.

SAS Institute. (2005). SAS OnlineDoc® 9.1.3: SAS/STAT 9 user’s guide.Retrieved September 20, 2006, from http://9doc.sas.com/sasdoc/

Scott, A. J., & Holt, D. (1982). The effect of two-stage sampling onordinary least squares methods. Journal of the AmericanStatistical Association, 77(380), 848–854.

Shinn, M. R. (1989). Curriculum-based measurement: Assessingspecial children. New York: Guilford.

Shinn, M. R. (1998). Advanced applications of curriculum-basedmeasurement. New York: Guilford.

Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analy-sis: Modeling change and event occurrence. New York: OxfordUniversity Press.

Steiger, J. H. (2000). Point estimation, hypothesis testing, and inter-val estimation using the RMSEA: Some comments and a reply to Hayduk and Glaser. Structural Equation Modeling, 7(2),149–162.

Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maxi-mum likelihood factor analysis. Psychometrika, 38(1), 1–10.

Woodcock, R. W., & Johnson, M. B. (1989). Woodcock-Johnson testsof achievement (Rev. ed.). Allen, TX: DLM Teaching Resources.

Ben Clarke, Ph.D., is a research associate at Pacific Institutes forResearch and deputy director of the Center on InstructionMathematics. He conducts research in the areas of early mathematicsassessment and instruction.

Scott Baker, Ph.D., is the director of Pacific Institutes for Researchin Eugene, Oregon. He conducts research on a range of education top-ics including instructional strategies for students at risk for math dif-ficulties and students who are English-language learners.

Keith Smolkowski, Ph.D., is a research associate at Oregon ResearchInstitute and a research methodologist at Abacus Research, LLC. Hehas extensive experience in the design, management, and data analy-sis of large school-based research studies. His research involves top-ics in education, prevention science, and public health.

David J. Chard, Ph.D., is the dean of the School of Education andHuman Development at Southern Methodist University. His researchand teaching focus on the accessibility of instruction for studentsexperiencing academic difficulties in literacy and mathematics.