Values and State Ratings: An Examination of the State-By-State Education Indicators in Quality...

9
Vdues and State Ratings: An Examination of the StateHByState Education Indicators in Quality Gregory Camilli William A. Firestone Rutgers, The State University of New Jersey What are educational indicators? What are the strengths and weaknesses of using such indicators? What do we need to learn about educational indicators? tatistics describing education S have been collected for well over a century. For example, during the rise of statistical reasoning in France, Taillandier (1828) wrote that The definitive result of these re- searches on the instruction of pris- oners reveals that 67 out of 100 are able neither to read nor write. What stronger proof could there be that ignorance, like idleness, is the mother of all vices? (p. 612) The historian, Theodore Porter (1986), noted that the general dispo- sition of early writers “was to pre- sent their findings as direct and incontrovertible proof of the propo- sitions they seemed to support” (p. 29). In contrast to their early use as isolated facts for arguing particu- lar issues, educational indicators in America have been used in the lat- ter half of the 20th century to de- scribe and understand educational systems at the district, state, and national levels. We now have vast compilations of educational indi- cators available-for example, the Condition of Education (Snyder & Wirt, 1998); Growth in School: Achievement Gains From the Fourth to the Eighth Grade (Barton & Coley, 1998) concerning the National As- sessment of Educational Progress; and Pursuing Excellence: A Study of U.S. Twelfth-Grade Mathematics and Science Achievement in Interna- tional Context (Takahira, Gonzales, Frase, & Salganik, 1998)concerning the Third International Mathemat- ics and Science Study. A lot of work has been done to improve the tech- nical quality of such educational in- dicator data, but less attention has been given to making such indica- tors accessible and useful to various audiences. We have very little idea about how legislators, policymakers (and their staffs), school administra- tors, graduate students, and others use educational indicators. One effort to make state indicator data more accessible to the public at large is through the annual Quality Counts series published by Editorial Projects in Education (“Quality Counts ‘98,”1998).The core of Qual- ity Counts ‘98 was a collection of 75 indicators describing the condition of education in the 50 states. These indicators were measures of state policies, demographic conditions, student achievement, and other as- pects of the educational system in each state, organized into five broad categories: state standards, school organization, teaching and teacher preparation, financing education, and student achievement. The Qual- ity Counts staff did several things to try to make this collection of infor- mation more attractive and accessi- ble; one of the most important was combining sets of statistics into composites and using those to grade states in these areas. In the summer of 1998, we were asked to provide technical assis- tance to the staff of Quality Counts between the releases of the 1998 and 1999 issues. An important issue that arose during our discussions with the staff was that-partly be- cause of limited knowledge about state indicators and partly because of the effort to make indicator data more accessible-value judgments were being made, sometimes unin- tentionally. Our expectation had been that indicators ought to be used primarily for descriptive, ex- ploratory, or diagnostic purposes. We were surprised at the degree to which persuasive, or normative- following the definition of sociolo- gists and political scientists-uses seemed to stem from those value judgments. These normative uses highlighted (for us) some unexpected problems with efforts to make indi- cator data accessible. Gregory Camilli is a Professor at Rut- gers, The State University of New Jersey, Graduate School of Education, 10 Semi- nary Place, New Brunswick, N J 08901. His specializations are applied measure- ment and testing policy. William A. Firestone is a Professor at Rutgers, The State University of New Jersey, Graduate School of Education, 10 Seminary Place, New Brunswick, N J 08901. His specializations are edu- cational policy implementation and soci- ology of education. Winter 1999 17

Transcript of Values and State Ratings: An Examination of the State-By-State Education Indicators in Quality...

Vdues and State Ratings: An Examination of the StateHByState Education Indicators in Quality

Gregory Camilli William A. Firestone Rutgers, The State University of New Jersey

What are educational indicators? What are the strengths and weaknesses of using such indicators? What d o we need to learn about educational indicators?

tatistics describing education S have been collected for well over a century. For example, during the rise of statistical reasoning in France, Taillandier (1828) wrote that

The definitive result of these re- searches on the instruction of pris- oners reveals that 67 out of 100 are able neither to read nor write. What stronger proof could there be that ignorance, like idleness, is the mother of all vices? (p. 612)

The historian, Theodore Porter (1986), noted that the general dispo- sition of early writers “was to pre- sent their findings as direct and incontrovertible proof of the propo- sitions they seemed to support” (p. 29). In contrast to their early use as isolated facts for arguing particu- lar issues, educational indicators in America have been used in the lat- ter half of the 20th century to de- scribe and understand educational systems at the district, state, and national levels. We now have vast compilations of educational indi- cators available-for example, the Condition of Education (Snyder & Wirt, 1998); Growth in School: Achievement Gains From the Fourth to the Eighth Grade (Barton & Coley, 1998) concerning the National As- sessment of Educational Progress; and Pursuing Excellence: A Study

of U.S. Twelfth-Grade Mathematics and Science Achievement in Interna- tional Context (Takahira, Gonzales, Frase, & Salganik, 1998) concerning the Third International Mathemat- ics and Science Study. A lot of work has been done to improve the tech- nical quality of such educational in- dicator data, but less attention has been given to making such indica- tors accessible and useful to various audiences. We have very little idea about how legislators, policymakers (and their staffs), school administra- tors, graduate students, and others use educational indicators.

One effort to make state indicator data more accessible to the public at large is through the annual Quality Counts series published by Editorial Projects in Education (“Quality Counts ‘98,” 1998). The core of Qual- ity Counts ‘98 was a collection of 75 indicators describing the condition of education in the 50 states. These indicators were measures of state policies, demographic conditions, student achievement, and other as- pects of the educational system in each state, organized into five broad categories: state standards, school organization, teaching and teacher preparation, financing education, and student achievement. The Qual- ity Counts staff did several things to try to make this collection of infor-

mation more attractive and accessi- ble; one of the most important was combining sets of statistics into composites and using those to grade states in these areas.

In the summer of 1998, we were asked to provide technical assis- tance to the staff of Quality Counts between the releases of the 1998 and 1999 issues. An important issue that arose during our discussions with the staff was that-partly be- cause of limited knowledge about state indicators and partly because of the effort to make indicator data more accessible-value judgments were being made, sometimes unin- tentionally. Our expectation had been that indicators ought to be used primarily for descriptive, ex- ploratory, or diagnostic purposes. We were surprised at the degree to which persuasive, or normative- following the definition of sociolo- gists and political scientists-uses seemed to stem from those value judgments. These normative uses highlighted (for us) some unexpected problems with efforts to make indi- cator data accessible.

Gregory Camilli is a Professor at Rut- gers, The State University of New Jersey, Graduate School of Education, 10 Semi- nary Place, New Brunswick, N J 08901. His specializations are applied measure- ment and testing policy.

William A. Firestone is a Professor at Rutgers, The State University of New Jersey, Graduate School of Education, 10 Seminary Place, New Brunswick, N J 08901. His specializations are edu- cational policy implementation and soci- ology of education.

Winter 1999 17

In the sections that follow, we first provide a brief introduction to educational indicators and their uses. Then we illustrate in depth some of the normative issues that arise both in choosing indicators to include in a compendium and in combining them into composites. Fi- nally, we provide an example of an exploratory use of indicator data by examining correlations between in- dicators of student achievement and those in the four other areas in- cluded in Quality Counts ‘98. This example illustrates some of the dif- ficulties in seeking explanations using state-level data. We end with a plea to consider the normative is- sues associated with standards de- velopment more carefully.

Indicators and Values An educational indicator is a statis- tic that describes an aspect of an ed- ucational system. This is obviously a broad definition which includes graduation and dropout rates per pupil expenditures, number of pri- vate schools, number of states with specific legal provisions, aggregate test scores, and size of the student population. What role do values play in selecting and developing in- dicators? According to Oakes (1986):

Not all statistics are indicators. To provide information about the health of a system, the indica- tor must have a reference point, some other measure or standard against which to judge the statis- tic. For example, the car’s tem- perature gauge has readings that represent a normal range for an engine. However, in a social sys- tem like education, setting the reference point is a more subjec- tive undertaking. Usually, the reference point is some socially agreed-on standard, a past read- ing, or a comparison with another location. (p. 1) In our opinion, this view is too

narrow for two reasons. First, there are policy relevant statistics for which either no standard exists or at least two standards exist that dif- fer substantively. Second, there are statistics for which even the direc- tion of “goodness” is not consensual. For example, the number of stu- dents enrolled in voucher programs is viewed very differently by advo-

cates of public schooling and greater parental choice.

In addition to single statistics, in- dicators can take a composite form. For example, an indicator that de- scribes the quality of a teaching force might include the number and level of earned degrees, years of ex- perience, and number of hours of professional development (Oakes, 1986, p. 4). Typically, these statistics are weighted and summed to obtain the composite. Values arise in con- structing composites in two ways: in determining which individual vari- ables to include and in deciding how to weight them. Moreover, for all but the few who read methodological ap- pendices, these judgments are often hidden when composites are used.

Uses for Indicators Indicators can be used for various purposes. The simplest use is de- scriptive, and a compendium of indi- cators might be published for this purpose. We can assume that the readers will have different reasons to refer to a compilation of indica- tors, but the publisher’s is to pro- vide broadly useful information to an audience with diverse needs in comparing or profiling educational systems. A second use is to monitor an educational system. According to Affholter (1994):

Outcome monitoring requires the routine measurement and report- ing of important indicators of (or related to) outcome oriented re- sults. . . . in ways that stakehold- ers can use to understand and judge those results. The indica- tors measured should have some validity, some meaning that is closely tied to performance expec- tations. The ways in which they are reported should also have utility, that is, they must be easily interpreted and focus attention on key points. (pp. 96-97).

A third use is €or diagnosing prob- lems in an educational system. This use goes beyond monitoring, which can be thought of as a kind of sig- naling process. For diagnosis, addi- tional information must be carried by the indicator (or indicators) that can facilitate intervention strate- gies; that is, the indicator must in part reflect the process by which the outcomes are created. For example,

a low district average in fourth grade mathematics achievement suggests a problem, but not a solu- tion. A diagnostic indicator, in this case, might be average class time devoted to math problem solving, which could help pinpoint where the problem is.

A fourth use of indicators is for ac- countability. In this case, either ab- solute or relative targets might be set for an educational unit and be combined with rewards and sanc- tions for a given performance. For example, in the case of absolute standards, a district may set an 85% pass rate for first-time takers of a state assessment; alternatively, with improvement standards, a gain may be required of at least 10% of the gap between last year’s pass rate and the 85% standard. This use is closely connected to outcome mon- itoring, but it is distinct from diag- nostic use.

Some advocates for indicator sys- tems would like to use indicators in a more explanatory or predictive way. In this case, indicators are often conceptualized within a sys- tem or model, either as an ulti- mately desired outcome-such as, student achievement-or as a factor known or believed to predict that outcome. According to Oakes (1986),

An indicator system is more than just a collection of indicator statis- tics about complex phenomenon, and it differs from a composite in- dicator that combines information to provide better understanding about one important aspect of that phenomenon. Ideally, a sys- tem of indicators measures dis- tinct components of the system of interest but also provides infor- mation about how the individual components work together to pro- vide the overall effect. (p. 7)

The presence of an underlying model facilitates the diagnostic and monitoring use of indicators. In both cases, the model helps the user know which indicators to track. Though it is possible to construct an indicator system for the sole pur- pose of monitoring, such systems are often intended to provide diag- nostic information. Thus, indicator systems are likely to contain implicit or explicit assumptions about causal processes in an educational system.

18 Educational Measurement: Issues and Practice

Some of the work with large-scale data points to another use of indica- tor data: exploration to understand how an educational system works, which in turn facilitates develop- ment of a more effective indicator system. In fact, one method of creat- ing an indicator system is first to propose a causal model of an educa- tional system and then use this model as a guide for selecting rele- vant statistics. A model in this sense is constituted by the variables nec- essary for explaining a desired out- come as well as the nature of the relationship between these vari- ables. This can be conceptualized as a path analysis model or structural equation diagram. An indicator’s role in such a model is established when there is consensus that it is an important outcome or when its con- tribution to explaining change in an outcome is demonstrated. Though there is a workable consensus on some aspects of modeling educa- tional outcomes (i.e., variables and links between variables), a more comprehensive model with substan- tial empirical support is required for the predictive use of indicators.

These uses all have value impli- cations. With descriptive uses, the publisher of the indicator is essen- tially leaving the value judgment to the user. Both monitoring and (espe- cially) accountability assume some consensus on what information is important. Problem solving with such indicators involves values be- cause a problem is often described as a discrepancy between the cur- rent situation and some preferred state. Similarly, explanatory and predictive applications focus on ef- forts to explain how to create some desired state, such as high achieve- ment. Thus, indicators also have a normative use.

Here, the adjective normative refers to the existence of social norms (Merton, 1968), rather than the standard psychometric connota- tion. Norms are rules of conduct held by a group or society and typi- cally grounded in a broader set of values. When such norms and val- ues can be quantified, indicators can be used to assess how “good entities are on the measured dimensions reflecting those norms. The implica- tion of normative use of indicators- that is, to measure the presence of desired states-is quite different

when there is consensus on values and norms and when there is not. In modern, complex societies, values and norms are often contested. Indi- cators can be used as part of such contests, especially when they are used to grade schools, districts, or states on the presence or absence of valued states. This is quite similar to lobbying groups’ use of congres- sional voting records to grade can- didates on support for their posi- tions. Then the voting record can be used so proponents of a position will know who to support and how to motivate change in the desired di- rection in the entity measured.

Like voting records, educational indicators can also be used to mea- sure the presence or absence of val- ued states in cases where the values behind the indicators are not so broadly shared. Strictly speaking, one could argue that almost all uses of indicators are normative because they measure directly or they mea- sure to predict some valued state. The important difference is in the degree of consensus and salience of the value behind the indicator. We use the term more narrowly to focus on areas of disagreement because they create special problems for in- dicator developers who seek to re- main neutral (and we will describe situations where indicator develop- ers are decidedly not neutral) or who are part of large public agencies that are expected to represent broad public consensus.

Quality Counts ‘98 includes sev- eral normative indicators. One ex- ample is treating the presence of charter schools as a positive devel- opment even though opinions on this reform are mixed. Another ex- ample is treating small class size as good where parents and teachers prefer it and taxpayer groups op- pose it as expensive. A third, more pervasive normative theme is the adoption of the system or standards- based reform view of how the educa- tional system should be organized. This approach suggests that states ought to have standards for what students should know and that other policies related to assessment, textbook purchasing, the distri- bution of incentives, and teacher preparation should be aligned with those standards (e.g., Elmore & Fuhrman, 1994). While this frame-

work reflects important learning from past reform efforts, it cannot be said to have been “proven,” and it has been criticized by academics (who think it may be too rationalis- tic) and conservative advocates of local control.

It is appropriate to include nor- mative indicators in a compendium like Quality Counts, even if they cannot be shown to predict achieve- ment. Often this is information peo- ple want. For instance, data on charter schools provide a measure of the extent to which a state educa- tion system is governed by political, as opposed to market, forces. Thus, it reflects preferences for the use of what Hirschman (1970) has called exit (i.e., advocating policies that entail leaving the public school sys- tem) as opposed to voice (i.e., advo- cating policies that entail working within the public school system) as a means for promoting institutional responsiveness.

Exploration of Quality Counts ‘98 Quality Counts ‘98 attempted to create a digestible compendium of information by grading the states based on five clusters of indica- tors representing (a) achievement, (b) high standards and aligned as- sessments, (c) teacher knowledge and skills, (d) school academic or- ganization, and (e) financial re- sources.1 In addition to presenting basic indicator data, composites were calculated in areas b-d to give each state grades, in the A-F letter format common in students’ report cards.2 Each state’s grades for stan- dards and assessment, quality of teaching, school climate, and re- sources (with adequacy, equity, and allocation-a measure of money get- ting to the classroom-measured separately) were repeated in a 2-3- page profile that followed the statis- tical tables. Expanded descriptions of the five clusters and their sub- components are given in Table 1. To clarify some of the ways in which values inform the development of indicators and some of the issues involved in developing systems of indicators, we first examine a prob- lem in selecting “appropriate” indi- cators. Next, we explore some issues relating to constructing composites

Winter 1999 19

Table 1 Description of Quality Counts indicator Clusters and Subcomponent Weights Quality counts cluster descriDtion Graded? ComDosite? Weight

All students achieving at high levels and engaged in high-level academic work

No overall rating 1996 NAEP Grade 4 math 1996 NAEP Grade 8 math 1996 NAEP Grade 8 science

High standards for all children and assessment aligned with those standards

Overal I Standards Assessments Accountability

Teachers who have the knowledge and skills to teach to higher standards

Overall Performance-based licensing

Degree in subject matter Professional development Requirements of teacher

education programs

Schools organized and operated in a manner most conducive to teaching and learning

system

Overal I Class size Student engagement Parent involvement School autonomy

Resources: Are they adequate, dis- tributed equitably, and focused on learning?

No overall rating Adequacy of resources Allocation of resources Equity among districts

No No No

Yes No N o No

Yes

No No No

No

Ye5 No No No No

Yes Yes Yes

No No No

Yes Ye5 Yes Yes

Yes

Yes No Yes

Yes

Yes Yes Yes Yes Yes

Yes N o No

NIA N/A NIA

100% 50%) 3 0 '/o

2 0%

100%

40'K 20% 2 0%

20%

1 00% 35% 20% 20% 2 5 '/o

NIA NIA NIA

for constructing grades. Finally, we examine how state level indicators of a variety of conditions correlate with state-wide measures of student achievement.

Taken together, these five clusters constitute an implicit indicator sys- tem with a model that includes in- puts (financial resources, teacher

knowledge and skills), processes (standards and assessments, school organization), and outputs (NAEP achievement levels). Each of these components could have included additional indicators. For example, student background could have been included as an input (though urban indicators are provided in a

separate section), and dropout rates are not considered as outcomes (though dropout rates are provided as additional information). To pro- vide a model of effective education, therefore, Quality Counts would need to be reorganized. However, it is not clear when it is important to provide more information and when doing so simply overburdens the reader, especially given the scant empirical evidence for a unitary model.

Choosing an Indicator Although a number of specific indi- cators in Quality Counts '98 reflect the standards perspective, the most important may be the ratings of ac- tual state standards. The best illus- tration of the normative nature of these ratings and the problems they create for indicator publishers is the number of available ratings and the lack of agreement among them on how to assess states. When Quality Counts '98 was assembled, two rat- ings of state standards were avail- able. The one compiled by the American Federation of Teachers (AFT) rated standards on their clar- ity and specificity (Gandal, 1997). To measure this, AFT raters examined whether state standards were writ- ten grade-by-grade or in grade clus- ters. They also rewarded states that gave realistic examples of what a standards-based lesson was. How- ever, the raters did not scruti- nize the overall content quality of the standard; rather, they noted whether there were any gaping holes in the curricula. AFT authors and criteria developers (who were teach- ers themselves) were clearly most concerned with whether state docu- ments could explicate what would be required of teachers.

The ratings prepared by the Council for Basic Education (CBE) focused on the rigor of state stan- dards (Joftus & Berman, 1998). To do this, they assembled panels of teachers, business leaders, and subject-matter specialists who de- veloped criteria for evaluating the content of the standards. In rating math standards, the panels decided to rely heavily on accepted or main- stream theories of what constitutes good mathematics instruction such as those espoused by the National

20 Educational Measurement: Issues and Practice

Council of Teachers of Mathematics (NCTM) and the creators of the National Assessment of Educational Progress (NAEP). In measuring English language arts standards, CBE drew their criteria from the document Standards for Excellence in Education (CBE, 1998).

The situation was complicated somewhat later when the Fordham Foundation (Raimi & Braden, 1998; Stotsky, 1997) released a third set of ratings. These ratings consciously tried to assess both the structure and content of standards docu- ments. However, the criteria they selected were different from, and in some ways antithetical to, the ones employed by the other organiza- tions-specifically, CBE. For exam- ple, Raimi and Braden (1998) took away points if state standards con- tained “False Doctrine” (p. vii) such as discouraging paper-and-pencil arithmetic or excessive emphasis on real world problems. They felt such elements of standards were “injuri- ous to the correct transmission of mathematics” (p. 10). Also, they looked for the inclusion of deductive reasoning in mathematics curricula. But these desiderata are opposite to those contained in NCTM stan- dards (19891, which, at that time, called for more real world problems and decreased emphasis on deduc- tive reasoning and paper-and-pencil arithmetic and therefore are likely to conflict with the CBE ratings which were geared to the NCTM standards. Under the circumstances, one would not expect a great deal of agreement in these three ratings of state standards.

Figures 1-4 show the agreement among these various rating^.^ In Figures 1 and 2, there is reasonable agreement between the CBE and AFT ratings of states in mathemat- ics and English with two notable exceptions. New Jersey’s math stan- dards are rated very high by CBE and quite low by AFT; Florida gets substantially higher ratings from AFT than CBE. Figures 3 and 4 compare the CBE and Fordham rat- ings, (Plotted points without state labels represent multiple states.) The apparent agreement in the English ratings must be interpreted cautiously because the Fordham rater only analyzed 28 states-seven

Winter 1999

5 f t 4

lo0l ‘i‘ 80 m

70 *

100

90 9

80 m

70 I

60 - 50

50 60 70 80 90 100 50 60 70 80 90 100

CBE Math

FIGURE 1. Scatter plot for AFT and CBE mathematics curriculum ratings

highly populated states and 21 agreement between CBE and Ford- states that received high ranks from ham on state English standards; AFT. CBE and AFT ratings are however, as shown in Figure 3, the available for over 40 states. For the disagreement between CBE and states rated, there is reasonable Fordham ratings in mathematics is

CBE English

FIGURE 2. Scatter plot for AFT and CBE English curriculum ratings 21

CBE Math

FIGURE 3. Scatter plot for Fordham and CBE mathematics curriculum ratings

striking. In general, the Fordham differences here are normative. ratings are much lower than those They reflect divergent underlying of the CBE. constructs for rating standards that

These differences are vexing for stem from conflicting political or indicator publishers. The important ideological positions. Resolution is

100

90

80

70

60

50

SO 60 70 80 90 1 00

100

90

80

70

60

50 m ’

1 SO 60 70 80 90 1 00

CBE English

FIGURE 4. Scatter plot for Fordham and CBE English curriculum ratings

not likely to come through better technical measurement.

Composites and Grading In the Quality Counts methodology, composite scores are obtained by (a) determining a weight, (b) multi- plying the weight by the individual variable, and (c) summing the results across variables. The weights are nor- mative judgments of the relative value of each individual variable in the composite. While reasonable peo- ple may disagree about the assump- tions of normative models, judgments based on values are inescapable, and, in some cases, it can be irre- sponsible to avoid such decisions.

Nevertheless, it is useful to un- derstand how the variables that make up a composite interact with one another. Exploration of this issue can help clarify the meaning of a composite and may, when com- bined with substantive knowledge of the area, suggest the need to break large composites into simpler constructs. Two composites proved especially complex. The first yielded an overall grade for the standards and assessments cluster. Factor analysis suggested that this com- posite had two underlying dimen- sions as shown in Table 2. The first dimension was anchored by the AFT clarity-of-standards ratings for English and math and also in- cluded the CBE rigor of English standards. It might be considered a measure of quality of standards. The second dimension included quality-of-assessment instruments (giving more points to performance and criterion-referenced-items for- mats than multiple choice), whether sanctions were applied when scores were low, whether the state adopted standards, and whether test scores were reported publicly. This second factor might be viewed as a measure of the elements in place to assess progress toward standards and t o motivate higher performance. Thus, simply having high standards and having the assessmenthncentive system to encourage educators to re- spond to standards seemed to be dif- ferent phenomena. If the two were separated, states might get higher grades in one area than another. This would allow the reader to con- template the meaning of a situation where a state received a high grade

22 Educational Measurement: Issues and Practice

Table 2 Pattern Matrix for Standards and Assessments Indicators

Indicators

Pattern loading

1 2

Assessment quality State sanctions performance Tests aligned to standards State adopted standards Public reporting State rewards performance Promotion standards State in NAEP AFT English rating AFT math rating CBE English rating CBE math rating

.92

.77

.69

.65

.64

.45

.44

.27

.96

.62

.56

.47

Note. Only coefficients above .25 are shown.

for its assessment-incentive system but a lower grade for its actual standards.

The second complex composite combined 13 variables to give an overall grade for school academic climate. In Table 3, factor analysis suggested three very different clus- ters of variables. The first consisted of 6 items taken from principals’ re- ports on conditions in their school in a survey conducted as part of the NAEP assessment. These included reports on how often parents came to parent-teacher conferences and

the extent to which student misbe- havior and absenteeism were per- ceived as problems. The second was defined primarily by reports on class size in fourth and eighth grade. The last cluster reflected the presence and strength of charter school legis- lation.

With many diverse elements in one composite, it is difficult to know how to make a decision based on the composite. Consider buying a copy- ing machine. A composite might in- clude speed, cost, and repair record for deciding which copier to buy. But

Table 3 Pattern Matrix for School Organization Indicators

Indicators

Pattern loading

1 2 3 ~ ~ ~ _ _ _ _ _ _ ~ ~

PT conferences Parent involvement Student behavior Student absence Parents visit school Student tardiness Class size Grade 8 Class size Grade 4 Open enrollments Site-Based management State education waivers State allows charter schools Strong charter school laws

.90

.90

.72

.71

.67

.62

.34

.26 -.29

.83 -.31

.75 -.29 -.47

.30 -.27

.92

.76

Note. The first 6 variables are principal-reported items from the NAEP background questionnaire. Only coefficients above .25 are shown.

how? If speed is weighted more than cost, one copier is chosen. If reliabil- ity is rated highest, another is cho- sen. In the case of states, which gets an A and which gets a C depends partly on the data and partly on the weighting of the individual ele- ments. If the elements are presented separately, the reader can make a decision. If the elements are not pre- sented separately, the reader may be overwhelmed with data. If a grade is given to a state, it’s easy for the reader to decide which state is best, but it’s hard to know what best means.

Disaggregation can help with di- agnosis. A low grade on a composite can suggest that a problem exists in a broad area, but it does not help specify where the problem is. Exam- ination of the components of the composite then helps to locate the problem-whether it be with stan- dards or school organization. An even more fine-grained diagnosis can be obtained by examining indi- vidual indicators. However, because individual variables are weighted, diagnosis must take into account the size of the weight. If the weights are essentially judgmental, the di- agnostic process is oriented toward achieving a better match between the current standing of a state and the ideal standing embodied by the composite.

These factor analyses and further perusal pointed to another kind of conceptual confusion in the items included in Quality Counts. For ex- ample, the school academic organi- zation composite includes a statistic reflecting state charter school poli- cies. Moreover, the composite for teacher knowledge and skills in- cludes some statistics that reflect state policy on teacher certification requirements and other statistics that reflect school conditions, like the percent of secondary teachers who hold a degree in the subject they teach. Thus, in both of these composites (school academic organi- zation and teacher knowledge and skills), policy and school condition variables were combined into single composites for grading purposes. Yet these kinds of variables have very different meanings for policy- makers. Policies are the kinds of thing that state government can act on directly, although the political

Winter 1999 23

complexities of doing so may vary from state to state and issue to issue.

School conditions have a very dif- ferent significance. Sometimes they are factors that interact with poli- cies to influence what the conse- quences of those policies will be. At other times, they are the targets that policies are intended to impact. One of the most useful aspects of Quality Counts ‘98 was that it pro- vided a succinct reference work for determining the frequency and dis- tribution of much discussed policies in the country. However, combining policies and school conditions into single indices was joining phenom- ena that are conceptually and em- pirically (as the factor analysis of the school organization variables in- dicate) quite distinct. There is a certain sense in evaluating states on policy variables, because they are under relative control, although states are making and implementing policies that are complex processes which require the cooperation of many people. It may be sensible to grade a state on larger demographic patterns-for example, when one is deciding whether to move to a particular state-but not when one is trying to assess progress toward educational reform of some sort.

Correlations With Achievement To move toward the development of an indicator system, one must de- velop a model of how the educa- tional system works and then test the connections among variables in the model. In an admittedly crude step in that direction, we computed the correlations between the NAEP percent proficient variables (the achievement measures used in Qual- ity Counts ’98) and the indicators, as they were used in the original pub- lication. We also added two impor- tant demographic variables for each state-median family income and percent of children living in poverty- as well as the percent of teachers with a degree in the subject they teach. Several key results are evi- dent in Table 4.

One surprise is that the stan- dards indicator has a mild negative correlation with achievement out- come. We suspect there are good reasons for this negative correla- tion. States that most aggressively

Table 4 Correlations Among Achievement Variables and Quality Counts ’98 Composite Indicators Comuosite indicators Math Grade 4 Math Grade 8 Science Grade 8

Standards Teacher knowledge School climate Funding adequacy Funding allocation Inequity Family income Poverty

-.I 2 .06 .54 .41 .39 .22 .34

-.64

-.39 -.I 8

.54

.20

.27

.37

.32 -.70

-.39 -.22

.65

. I 9

.37

.4 1

.20 -.65

set and enforce standards in order to raise curriculum and perfor- mance standards may be those most in need of doing so. Far from being counterintuitive, this finding sig- nals an important limitation of the indicators that are currently avail- able. Namely, we are not yet at a state of development in which the indicators can be viewed as explana- tory, much less predictive. Even where correlations between achieve- ment and policy variables can be es- tablished, the causal direction is not clear. It is hard to know when changes in policy lead to improved achievement and when shifts in achievement signal the need for pol- icy change.

There are other reasons why it may continue to be difficult to use state indicators to predict achieve- ment. Most notably, state indicators mask a great deal of variation in both student characteristics and ed- ucational practice. It is often hard to establish correlations between school factors and achievement, and it is even more challenging at the district level. Moreover, with policy variables, the chain of events run- ning from the adoption of a policy through district and school response to the actions of teachers and then children’s learning are extremely long. Even when they “work,” poli- cies are subject to the vagaries of local implementation (McLaughlin, 1987). To use indicators as policy levers in the framework of a causal model, the level of aggregation must match the kind of inference desired. It does no good to uncover a causal relationship without some notion of how to use this knowledge. A find-

ing of causation must be supple- mented with information about the mechanisms involved.

Conclusion One consequence of using indicator data may be the improvement of the indicator system itself. The discrep- ancies among standards indicators have prompted an effort by Achieve- a nonprofit organization born out of the 1996 National Educational Summit of governors and business leaders4-to bring together what has been learned from these three sepa- rate rating systems and come up with a sounder methodology, or at least one on which there is greater consensus.

Indicator publication can also cre- ate a “horse race” or competition for high ratings. This can be use- ful when indicators represent or are correlated with strongly ac- cepted values-for example, student achievement. However, a horse race can be problematic when efforts to do well on a rating lead a state to de- emphasize some other unmeasured value. This is the recurring concern about teaching to the test. More- over, it is difficult to know what to make of a horse race when, as in the case of standards ratings, the hand- icappers cannot agree on who is ahead. Education is one of many arenas where battles over national values and political philosophies are fought, and some normative indica- tors provide useful tools for assess- ing which camps are dominating.

Indicators can serve useful pur- poses despite the lack of a con- sensual model for educational

24 Educational Measurement: Issues and Practice

effectiveness. For example, the iden- tification of unusual or problematic trends, such as the 10-year decline in SAT scores, stimulated explo- ration of reasons for the decline, public discussion of the issues, and greater public awareness (Koretz, 1986, 1987). However, models of ed- ucational efficacy have not yet been discovered that can relate inputs and processes to educational out- puts. Indeed, it is uncertain, given our discussion of social norms, whether such models can exist. It is thus crucial to avoid simpIe causal inferences based on the correla- tional analysis of variables in an in- dicator system. Not only can such analyses confuse cause and effect, but, even when a cause is correctly identified, it may exist at the wrong unit of analysis for designing educa- tional interventions.

Educational and other social indi- cators require continual updating and improvement as consequences of their use become known. The Quality Counts series illustrates both the strengths and weaknesses of the current state of the art. Its apparent popularity suggests that what is needed right now is a con- tinuing development effort. Better indicator systems have developed through years of trial and error learning and are likely to progress on a number of fronts. We would like to see more attention given to two areas. The first concerns ethical issues raised by the normative use of indicators. What role should indi- cator developers and publishers play in the culture wars surround- ing schools? This issue is fairly straightforward for advocacy groups like the Fordham Foundation. It is more challenging for organiza- tions like the publishers of Quality Counts whose objective is to get in- formation of interest out to a wide variety of users. The problem of pro- viding unbiased news coverage and unbiased indicators may turn out to be quite comparable. Moreover, there may be trade-offs between presenting a balanced picture and generating readership. In some in- stances, the measures of more con- troversial phenomena, like charter school legislation, may be of most interest.

The second issue is clarifying user needs or interests. The staff of Qual- ity Counts should be commended

for conducting research on who its readers are and what they want. More generally, it is important to know more about who uses indica- tors, what those indicators are used for, and how to make them more useful. It may also be helpful to ask more value-laden questions like how do or can indicators contribute to the improvement of education or the more adequate functioning of a de- mocratic society.

Notes The research that forms the basis for

this article was supported by the Pew Charitable Trusts. The opinions ex- pressed here are those of the authors and do not necessarily reflect the views of the Trusts or of Rutgers University. Thanks are due to Paul Gale, David Mayrowetz, Lora Monfils, and the staff of Quality Counts for their help with this article.

Each year’s volume may use a some- what distinct strategy. Grading was part of the approach in the 1998 and 1999 volumes.

Rules for combining indicators into grades are described more completely on the pages with the indicator data.

We scaled each of the original nu- merical ratings so they had a common scale that ran from 50 to 100 because CBE generally gave higher marks to states than the AFT, which rated higher than the Fordham Foundation. We then plotted the AFT and Fordham ratings against those of the CBE in English and in mathematics. According to the Qual- ity Counts grading system, 10 points along this scale corresponds to a differ- ence of about one letter grade (e.g., A versus B). We added diagonal axes that represent a single-letter grade differ- ence above and below perfect agreement (i.e., the y = x: line). Plotted points out- side this diagonal region, in our view, represent notable disagreement be- tween ratings.

Information on Achieve comes from its website, <www.achieve.org>.

References Affholter, D. P. (1994). Outcome moni-

toring. In J . S. Wholey, H. P. Hatry, & K. E. Newcomer (Eds.), Handbook of practical program evaluation (pp. 96-118). San Francisco: Jossey-Bass.

Barton, P. E., & Coley, R. E. (1998). Growth in school: Achievement gains from the fourth to the eighth grade. Princeton, NJ: Educational Testing Service.

Council for Basic Education. (1998). Standards for excellence in education. Washington, DC: Author.

Elmore, R. F., & Fuhrman, S. H. (1994). The governance of curriculum. Alexan-

dria, VA: Association for Supervision and Curriculum Development.

Gandal, M. (1997). Making standards matter: An annual fifty-state report on efforts to raise academic standards. Washington, DC: American Federa- tion of Teachers.

Hirschman, A.O. (1970). Exit, voice, and loyalty: Responses to decline in firms, organizations, and states. Cambridge, MA: Harvard University Press.

Joftus, S., & Berman, I. (1998, Janu- ary). Great expectations? Defining and assessing rigor in state standards for mathematics and English language arts. Washington, DC: Council for Basic Education.

Koretz, D. M. (1986). Dends in educa- tional achieuement. Washington, DC: Congressional Budget Office.

Koretz, D. M. (1987). Educational achievement: Explanations and impli- cations of recent trends. Washington, DC: Congressional Budget Office.

McLaughlin, M. W. (1987). Learning from experience: Lessons from policy implementation. Educational Eual- uation and Po1ic.y Anal.ysis, 9 (21, 171-178.

Merton, R. K. (1968) Social theory and social structure. New York: Free Press.

National Couiicil of Teachers of Math- ematics. (1989). Curriculum and eval- uation standards for school math- ematics. Reston, V A Author.

Oakes, J. (1986). Educational indica- tors: A guide for policymakers (CPRE OPE-01). New Brunswick, NJ: Center for Policy Research in Education.

Porter, T. M. (1986). The rise of statisit- cal thinking, 1820-1900. Princeton, NJ: Princeton University Press.

Quality counts ‘98: The urban chal- lenge. (1998, January 8). Education Week, 17.

Raimi, R. A,, & Braden, L. S. (1998). State mathematics standards: An ap- praisal of math standards in 46 states, the District of Columbia, and Japan. Fordham Report 2 (3). Wash- ington, DC: Fordham Foundation.

Snyder, T., & Wirt, J. (1998). The condi- tion of education, 1998 (NCES Report No, 98013). Washington, DC: Na- tional Center for Education Statistics.

Statsky, S. (1997). State English stan- dards. Fordham Report 1 (1). Wash- ington, DC: Fordham Foundation.

Taillandier, A. (1828). Review of “Conipte general de l’administration de la jus- tice criminelle en France” [General account of the criminal justice admin- istration in France]. Revue encyclope- dique, 40, 600-612.

Takahire, S., Gonzales, P., Frase, M., & Salganik, L. H. (1998). Pursuing ex- cellence: A study of US. twelfth-grade mathematics and science achievement in international context (NCES Re- port No. 98049). Washington, DC: Na- tional Center for Education Statistics.

Winter 1999 25