Download - Performance-Based Assessment and Instructional Change: The Effects of Testing in Maine and Maryland

Transcript

http://eepa.aera.net

Analysis Educational Evaluation and Policy

DOI: 10.3102/01623737020002095 1998; 20; 95 EDUCATIONAL EVALUATION AND POLICY ANALYSIS

William A. Firestone, David Mayrowetz and Janet Fairman Maine and Maryland

Performance-Based Assessment and Instructional Change: The Effects of Testing in

http://epa.sagepub.com/cgi/content/abstract/20/2/95 The online version of this article can be found at:

Published on behalf of

http://www.aera.net

By

http://www.sagepublications.com

can be found at:Educational Evaluation and Policy Analysis Additional services and information for

http://eepa.aera.net/cgi/alerts Email Alerts:

http://eepa.aera.net/subscriptions Subscriptions:

http://www.aera.net/reprintsReprints:

http://www.aera.net/permissionsPermissions:

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Educational Evaluation and Policy Analysis Summer 1998, Vol. 20, No. 2, pp. 95-113

Performance-Based Assessment and Instructional Change: The Effects of Testing in Maine and Maryland

William A. Firestone David Mayrowetz

Janet Fairman Rutgers University

To examine how performance-based assessment changed mathematics teaching under conditions of moderate and low stakes, we studied middle school teachers in five districts in Maine and Maryland. Our observations suggest that the effects of state testing on teaching may be overrated by both advocates and opponents of such policies. When combined with moderately high stakes and other conditions, such assessments generate considerable activity focused on the test itself. This activity can promote certain changes, like aligning subjects taught with the test. It appears to be less successful, however, in changing basic instructional strategies.

Recently, there has been considerable debate about how useful state assessment is as an instrument for reforming educational practice. One group decries high-stakes assessment as dumbing down the cur­riculum, de-skilling teachers, pushing students out of school, and generally sowing fear and anxiety among both students and educators (e.g., Corbett & Wilson, 1991; Smith, 1991). A second group, arguing for measurement-driven reform, has argued that, in effect, "if you test it, they will teach (and learn) it" and that assessment can guide the educa­tional system to be more productive and effective (Popham, 1987). This position has taken on added strength with the recent rise of performance-based assessment that purports to offer a technology for assessing higher-order skills and a deeper under­standing of content that improves on the earlier generation of minimum basic skills tests using multiple-choice items (Baron & Wolf, 1996; Rothman, 1995).

While these two positions disagree about the consequences of state testing, they agree that as­sessment is a powerful lever for shaping instruc­tion. A third view questions whether policy in gen­eral can influence teaching (e.g., Cohen, 1995; Cohen & Ball, 1990). From this perspective, how teachers teach depends on deeply ingrained beliefs

combined with limited knowledge about both peda­gogy and content. Taken together, these factors contribute to the typical American approach to in­struction, with its shallow coverage of too many topics in a ritualized manner that fails to engage students or offer them sufficient challenge. While students may learn to recall specific facts, they are rarely forced to think deeply about much of any­thing. In this view, the American policy system, even with the introduction of new forms of testing, is too fragmented and unstable to overcome the forces of inertia that maintain the status quo.

This article explores the contribution that the new assessment strategy can make to improving instruc­tion through a careful examination of educators' responses to middle school mathematics assess­ments in Maine and Maryland. This research sug­gests that when combined with moderately high stakes and other conditions, such assessments gen­erate considerable activity focused on the test it­self. This activity can promote certain changes, like aligning subjects taught with the test. It is less suc­cessful, however, in changing basic instructional strategies.

Instruction and Assessment Policy

The classical pattern of teaching in America has

95

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Firestone, Mayrowetz, and Fairman

been well documented (e.g., Goodlad, 1984; Powell, Farrar, & Cohen, 1985). It focuses on teacher-led recitation, where students, working alone, are expected to remember and reproduce a wide range of specific facts (and, in mathematics, discrete operations). The number of topics and the shallowness of their coverage distinguishes Ameri­can education from the schooling practiced in most countries to which we would like to be compared (Schmidt, McKnight, & Raizen, 1996). Moreover, American students are ordinarily grouped by age and ability, and those in lower tracks are generally believed incapable of learning the same amount or at the same rate as those in higher tracks. This ba­sic approach has proved remarkably resilient in the face of numerous attempts to reform it in this cen­tury (Cuban, 1993; Elmore, 1996).

Many researchers and reformers are now asking whether the introduction of new assessments can break that pattern. At one level, there is consider­able evidence that testing can change patterns of teaching, if only by promoting "teaching to the test." This is an ambiguous phrase with many meanings, including teaching general test-taking skills like how to fill in answer sheets, teaching using materi­als like those on the test, and teaching content known to be covered on the test, perhaps at the ex­pense of content not covered; Smith (1991) noted these and more in her work. Corbett & Wilson (1991) suggest that such tests encourage educators to intensify the use of old means to address new problems. However, advocates of newer assessment technologies, like Rothman (1995,) suggest that statewide use of performance-based assessment and related approaches has the potential to transform large numbers of classrooms. The argument for using assessment as a tool for reform is made rather eloquently by the Maine Department of Education (1994, p. 2):

Many educators argue that an ideal test would be such that "teaching to it" would be desirable. Teaching to a multiple-choice test uses instruc­tional time to teach skills that are only used dur­ing testing. Because open-response questions like those used in the MEA are more direct measures of what we want students to be able to do in non-testing (and even non-school) situations, however, teaching students how to deal with such questions should be an integral part of instruction.

The direction of transformation expected by sup­porters of performance-based assessment is much like that expected by supporters of standards-based

96

reform, which has euphemistically been phrased as "tough stuff for all kids" (Smith & O'Day, 1991). One leader in the effort to promote more challeng­ing standards is the National Council of Teachers of Mathematics (NCTM), whose standards have become a model emulated by other subject areas. Central to these standards is the concept of "math­ematical power," which

denotes an individual's abilities to explore, conjecture, and reason logically, as well as the ability to use a variety of mathematical methods effectively to solve nonroutine problems. This notion is based on the recognition of mathemat­ics as more than a collection of concepts and skills to be mastered; it includes methods of investigat­ing and reasoning, means of communication, and notions of context. (NCTM, 1989, p. 5)

This statement recognizes that students need to be able to calculate accurately, but it also expects them to be able to hypothesize, abstract mathematical properties, explain their reasoning, and validate their assertions about mathematical issues. Lampert (1990) illustrates teaching to promote such capaci­ties. She documents a class of fifth graders who analyzed patterns of squared numbers and then numbers raised to other powers in order to develop laws governing exponentiation. The students not only learned mathematical content but also engaged in mathematical reasoning and developed proofs.

Insight into teaching methods that can promote a more thoughtful approach to mathematics comes from the Third International Mathematics and Sci­ence Study (ΉMSS) videotape study, which com­pared the practices of 81 eighth-grade teachers in the United States with 50 in Japan (Stigler & Hiebert, 1997). These researchers found that the typical American lesson had two phases:

In the acquisition phase, the teacher demon­strates or leads a discussion on how to solve a sample problem. The aim is to clarify steps in the procedure so that students will be able to execute the same procedure on their own. In the applica­tion phase, students practice using the procedure by solving problems similar to the sample prob­lem, (p. 11)

In Japan, where students scored much higher than in the United States

The lesson focuses on one or sometimes two key problems. After reviewing the major point of the previous lesson and introducing the topic for today's lesson, the teacher presents the first prob-

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Performance-Based Assessment and Instructional Change

lem. This problem is usually one that students do not know how to solve immediately but for which they have learned some crucial concepts or pro­cedures. . . . Students are asked to work on the problem... and then to share their solutions. The teacher reviews and highlights one or two aspects of the student solution methods or presents an­other solution method, (p. 12)

These holistic differences were reflected in a num­ber of discrete contrasts. In comparison with Japa­nese classrooms, American classes showed much less deductive reasoning (in no American classes vs. 62% of the Japanese), much more stating of concepts without developing them (78% vs. 17%), and much more requiring students to practice pro­cedures (96% vs. 40%) instead of inventing them (less than 1% vs. 44%). Japanese teachers are also considerably more active than their American coun­terparts, lecturing to explain concepts and proce­dures or summarizing the results of student explo­rations in 71% of the observed classes. American teachers only lectured in 15% of their classes. Stigler & Hiebert (1997) note that Japanese teach­ing does not directly reflect the NCTM standards, nor is it likely to be strictly adaptable in the Ameri­can context. Still, it is a useful exemplar of an ap­proach to teaching that encourages students to not only use mathematical algorithms and procedures to reach correct solutions but to learn better how to develop and apply them. The question we asked was whether the introduction of performance-based assessment would encourage teaching that re­sembled this or only more teaching to the test in the narrow sense of the term.

Several factors still work against the introduc­tion of such instructional approaches. One is teach­ers' knowledge. Observers have suggested that most teachers simply lack deep enough knowledge of the content they teach and the ways to teach it (Ball & Rundquist, 1993; Cohen, 1995). Even teaching such "basic" concepts as addition in a way that goes beyond introducing and practicing procedures may require a deeper understanding of the underlying mathematical concepts, the psychology of learn­ing the subject, and the appropriate pedagogy than most teachers commonly have (Fennema, Carpen­ter, & Peterson, 1989).

A second factor is the curriculum. The mathemat­ics curriculum is typically viewed as a fixed set of topics or courses that students must move through in a particular sequence because knowledge of ear­lier subjects is necessary for learning later ones.

Students' ability is reflected in the speed with which they can move through this sequence and their fa­cility with "higher" subjects like algebra, geom­etry, and calculus. This view of the subject has been built into most textbooks, district curricula, and— until recently—assessment tools (Grossman & Stodolsky, 1994; Schmidt, McKnight, & Raizen, 1996). Any effort to deviate from the expected cur­riculum to explore new topics or use different in­structional approaches gets in the way of "cover­ing" the set curriculum. Moreover, existing mate­rials simply do not support this new approach to instruction. One problem faced by Vermont teach­ers working with an assessment system that encour­aged this new approach to instruction is that they simply lacked good problems (Firestone & Pennell, 1997).

Because the current approach to teaching is so deeply ingrained in both the psychology of educa­tors and the organization of schooling, it has been argued that serious and sustained policy interven­tions are required to modify it. Assessments can play a key role in these interventions. Tests can serve useful persuasive and educational functions (McDonnell, 1994). Test items provide definitions and criteria of successful learning and performing, and they provide models of useful instructional strategies. Test scores give evidence about how well or badly an individual, a school, or even a country is doing. In this regard, the shift to performance-based assessment that requires students to show their knowledge by constructing a response—i.e., writing an essay or showing how they solve a math­ematical problem—can be quite important by mod­eling a form of knowledge quite different from the recall normally required by multiple-choice tests (Rothman, 1995).

Such modeling can be very helpful, but for edu­cators to take these ideas seriously, some suggest that they must be buttressed by a mix of pressure and support in order to address both the will and capacity of educators to change (Fullan, 1991; McLaughlin, 1987). Pressure can normally come from stakes or sanctions, the administration of which depends on test scores (Corbett & Wilson, 1991). Stakes can be targeted at either students or educators and can take a variety of forms. Passing a test can be a requirement for graduation from school. The proportion of students achieving at a certain level can trigger consequences for educa­tors ranging from merit pay to state takeover. There can also be indirect stakes as when test scores are

97

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Firestone, Mayrowetz, and Fairman

published in newspapers leading to comparisons between schools and districts. These can affect edu­cators' sense of self-esteem, but they can also have more direct consequences if they prompt public discontent or—as happens in England—if parents have the opportunity to choose their schools on the basis of past performance.

Past research has emphasized the negative con­sequences of high stakes. In the 1980s, severe con­sequences created pressures that encouraged teach­ers to emphasize drill-based instruction, narrow­ing of content, and the regurgitation of facts even more than they did normally (e.g., Corbett & Wil­son, 1991; Smith, 1991). In addition, substantial time was lost in test preparation activities—that is, learning the test formats rather than additional con­tent. However, for the most part, the tests that were studied were multiple-choice, basic-skills-oriented tests, not the newer performance-based assess­ments. Even then, some argued that while the prob­lem might be excessive pressure, the solution could include redesigning the tests (Madaus, 1988). There is as yet very little research on the effects of high stakes linked to performance-based assessments (but see Koretz, Mitchell, Barron, & Keith, 1996; Smith, 1996). When set too high, stakes could re­sult in the same problems with performance-based assessments as with other tests. On the other hand, educators are likely to ignore assessments that model forms of teaching and conceptions of learn­ing with which they disagree or that they do not understand unless some pressure is applied to take them seriously.

Yet it seems improbable that applying more pres­sure to educators will get them to use instructional approaches that they do not understand. Steps must also be taken to increase their capacity to teach in new ways (McLaughlin, 1987). An important di­mension of capacity in this regard is financial; dis­tricts need the wherewithal to buy new materials aligned with new assessments and time to help teachers learn new educational approaches (Smith, 1996). Another dimension refers to knowledge. If tests adequately represent new standards and new conceptions of content to be learned in a subject area, teachers will need to understand short-term issues such as what it takes to score well on those tests. They may also need the deeper pedagogical content knowledge to help students learn the basic subjects at a more profound level than they have in the past (Shulman, 1987). While there has been a fair amount of research on the effects of different

kinds of stakes, less attention has been given to the capacity-development problem, in part because states have attended less to this issue. There have been a few studies of approaches to building teach­ers' capacity linked to assessment programs (Firestone & Pennell, 1997; Murnane & Levy, 1996), but we have not drawn on relevant interna­tional experience (e.g., Radnor & Shaw, 1995), and we are just beginning to develop good frameworks for exploring this issue (Corcoran & Gœrtz, 1995).

To explore how state assessments affect teach­ing, we asked the following questions:

• What responses to state mathematics tests can be described in each state? • What changes in standard modes of teaching mathematics are noted in each state? • What factors explain the patterns of stability and change noted in each state?

Methods

Data reported here come from a qualitative study of administrative and teacher responses to testing policies in states that have recently adopted perfor­mance-based assessments. Maryland was selected because the state linked formal sanctions to test performance, while Maine did not have such for­mal stakes. We chose the states because both had eighth-grade tests and our focus was on how tests changed middle school mathematics instruction.

State Contexts and Testing

Maine is a fairly decentralized state, with 188 local entities ("towns," school administrative dis­tricts [SADs], and supervisory unions).1 The state makes relatively little educational policy. The Maine Educational Assessment (MEA) began in 1984-1985 as a multiple-choice test after the Education Reform Act of 1984 mandated a statewide testing system for the first time in Maine (Table 1). Over time, the department of education increased the proportion of open-response items until by 1994-1995, it was entirely open response in format. Sup­port for educational policy initiatives is limited by Maine's small and declining department of educa­tion. In 1991, the department's assessment staff was cut from five to one, limiting capacity for outreach.

The context of testing policy is different in Mary­land. First, middle school students must deal with two state tests. The Maryland Functional Tests— the multiple-choice, minimum-competency high school graduation tests studied by Corbett and Wil-

98

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Performance-Based Assessment and Instructional Change

TABLE 1 General Features of Two Assessments

History and test design Maine Educational Assessment Maryland School Performance features (MEA) Assessment Program (MSPAP)

Policy initiating test and Education Reform Act of 1984 Test is part of the Maryland School date of implementation mandated statewide assessment. Performance Program and accounta­

MEA first implemented in 1984. bility system. MSPAP implemented in 1991.

Date test included only October 1994 May 1991 open-response items

Subjects tested and total Seven subjects Six subjects testing time per student Five to seven hours testing per

student Up to nine hours' testing

Number math items and Eight common and two matrix- Two or three matrix-sampled items. testing time on math sampled items per student. portion 80 minutes for math, with up to 20

minutes' extra time allowed. 90 minutes for math.

Group work None Some

Time of year administered October May for eighth grade

Testing stakes Test scores are published in the Test scores are published. Reconstitu-newspaper. tion for schools with consistently

declining test scores is possible.

Access to test Test security less extreme, but test Heavy test security, although "public scored by contractor, except for release items" like those on test writing assessment, which is available. Some teachers score teacher scored. assessment. Many administer it.

State-level professional None geared to test. Maryland Assessment Consortium, development Maryland Governor's Academy

son (1991)—are now passed by most students in the seventh grade. The eighth-grade performance-based test under examination here is the Maryland School Performance Assessment Program (MSPAP), which had an entirely open-response format from its first implementation in 1991. It became mandatory as of 1993. The performance assessment developed as part of a school account­ability system that evolved over several years. Sec­ond, the school system is much more centralized in Maryland than in Maine. While Maryland has a larger population, schools have been consolidated into 24 (usually) county-wide districts. Third, the Maryland State Department of Education (MSDE) has a long history of addressing curricular and other

issues, usually with substantial district input, and mandating procedures to be followed locally (Corbett& Wilson, 1991).

Both the MEA and MSPAP are significant de­partures from multiple-choice testing requiring stu­dents to engage in reasoning, problem-solving, and mathematical communication—skills specifically emphasized in the NCTM standards (NCTM, 1989).2For instance, both tests de-emphasize pure calculation by allowing students to use calculators on at least parts of the tests. Students must reason about what appropriate solution method to use. Most items are multipart, where students perform a series of tasks and answer questions related to a central problem of situation.

99

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Firestone, Mayrowetz, and Fairman

Mathematical communication is strongly encour­aged because items require students to write justi­fications for their solutions, graph date, or construct pictorial representations of solutions. These char­acteristics of test problems reflect the NCTM stan­dard, which urges that "students assume more re­sponsibility for validating their own thinking" (NCTM, 1989, p. 79). Students construct responses that are scored by trained judges using predeter­mined rubrics. Scaled scores for items on the MSPAP and performance levels associated with scaled scores on the MEA are determined by speci­fied performance criteria, which can include the understanding of mathematical concepts, the ap­plication of deductive or inductive reasoning, the ability to interpret data or make conjectures, the ability to determine the reasonableness of solutions, the ability to validate arguments, and the appropri­ate use of mathematical language.

Some general differences between the two as­sessments are highlighted in Table 1. Maine tests arts and humanities together and health, while Maryland tests language usage. Both assessments cover reading, writing, math, science, and social studies. The total testing time per student is longer on the MSPAP; MSPAP items generally appear to have more parts and steps and take longer to com­plete than the MEA items. Another difference is the use of group or paired work on some MSPAP tasks. Typically, students work together on tasks involving experimentation or manipulation of physical materials and data collection as in the mathematics/science problem we reviewed. The difference in time of test administration is also im­portant to note. Because the MEA is administered in October of the eighth-grade year, it should be regarded as a test of what students have learned through the seventh grade, while the MSPAP is administered at the end of the eighth-grade year in May. Differences in the nature of mathematical content covered by the two assessments are de­scribed in more detail in a later section of this ar­ticle.

Another important difference is in the stakes or repercussions associated with published test scores in the two states (Table 1). While Maine has no formal stakes for schools or students, test scores are published in local newspapers, which encour­ages comparison of school performance within and across districts. Maryland also releases test scores to the media. More important, it "reconstitutes" schools with consistently declining test scores.

Reconstitution is a state intervention that can in­clude removing teachers or administrators from the school. Eligibility for reconstitution is determined by a formula that incorporates test score indices from both the Functional Tests and MSPAP, along with student attendance and graduation rate indi­ces. Schools threatened with reconstitution have the opportunity to remedy their problems. Maryland had slated 52 schools for reconstitution by the spring of 1997; all but two of these schools are in the Bal­timore City district.

Patterns of access to the assessment—which should facilitate both teaching to and learning from the test—were different in the two states. Mary­land emphasized test security more than Maine, although it did make items like those on the test available to the public. Some Maryland teachers got to know the test by scoring responses during the summer; this did not appear to provide the learn­ing opportunity that a similar activity did in Ver­mont, however (Murnane & Levy, 1996). Even more teachers became familiar with the MSPAP by administering it, which required extensive prepa­ration because of the complex student tasks. Maine teachers seemed to have easier access to their as­sessment as items were released after use.

Finally, Maryland offered somewhat more pro­fessional development opportunities for teachers that were linked to the state tests. The state sup­ported two activities. The Maryland Assessment Consortium developed items like those on the MSPAP for teachers to use in the classroom. The Maryland Governor's Academy had an agenda that was not as tightly linked to the state assessment, but it did provide teachers with useful information linked to the assessment.

Data Collection

Because the main purpose of this study was to examine how state policies were locally interpreted, we chose an embedded case-study design that al­lowed us to look at teachers within districts within states. The original plan was to select a poor and a middle-wealth district in each state for study, ex­pecting this strategy to give us one high-scoring and one low-scoring district. This approach made sense in Maryland, but a major source of variation in Maine was between the few relatively urban dis­tricts and the many rural ones. There we chose one urban district as well as a poorer and wealthier sys­tem from the smaller districts.

Table 2 shows some demographic characteris-

100

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Performance-Based Assessment and Instructional Change

TABLE 2 District Characteristics

Districts

Maryland Maine

District characteristics Chesapeake Frontier River City Farm Town Factory Town

Number of students 10,000-15,000 10,000-15,000 2,000-4,000 2,000-4,000 *<2,OOO

Number of middle schools 5 3 2 3 1

Percent on free lunch 20 45 35 40 20

Quintile for eighth-grade math scores 3 4 3 4 3

tics of the five districts we visited during the 1994-1995 school year. The two Maryland districts are as much as three times larger than the largest in Maine, but they are smaller than most in the state, where student populations average 33,000 per dis­trict. One implication of this difference is that the Maryland districts had much larger central offices with several curriculum specialists and a director of testing. In addition to the superintendent, the Maine districts sometimes had less than one full-time central office person working on curriculum issues. All five districts had student populations that were more than 90% White. This distribution re­flects the student population of Maine rather well but not Maryland, where, statewide, the student population is 58% White. The two states were quite similar in one measure of student poverty—the pro­portion on free lunch, about 30% in each state. The districts chosen were spread around that figure, with the two wealthier districts having only about 20%

eligible for free lunch and the two poorer districts having roughly twice as many, with River City fall­ing in between.

Our data collection strategy was to work down the organizational hierarchy, getting the context first from administrators. Table 3 shows the number of interviews conducted in each district. The first site visit was made to talk to central office administra­tors, board members, principals, and teachers. Al­though we focused on changes in the teaching of mathematics, we wanted to get some feel for changes in a variety of subjects. Therefore, during the first site visits, we intended to interview not only department heads in mathematics but also in English and social studies. Because many schools had grade-level teams or other arrangements (not departments), we often had to ask the principal to select the most knowledgeable teacher in each of these subject areas. Interviews followed semi-struc­tured interview guides that specified topics to be

TABLE 3 Respondents Sampled by District

Districts

Maryland Maine

Sampling categories Chesapeake Frontier River City Farm Town Factory Town Total

Schools visited 2 2 2 3 1 10 School board member 1 1 1 1 1 5 District administrator 7 5 2 1 2 17 Principal 2 2 3 3 1 11 Department head or

other teacher 8 4 4 6 2 24 Math teacher 6 5 6 5 3 25

Total interviewed 24 17 16 16 9 82

101

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Firestone, Mayrowetz, and Fairman

covered but, in most cases, did not require specific phrasing of questions (Patton, 1990). Interviews with administrators lasted from one to one and one half hours.

The next two site visits were made to visit math teachers. Where schools were large enough, we tried to meet with three eighth-grade mathematics teach­ers in each school. Especially in Maine—where the state test was administered in the fall of the eighth grade and much preparation would have to happen in seventh grade—we had to include math teach­ers from lower grades because there were not enough eighth-grade teachers in some schools. Each site visit included observations of two classes and an interview. Observers kept a running record of events in each class and summarized their obser­vations afterward in terms of a set of prespecified categories. Teachers were also interviewed each time, partly to get more information on the classes observed but also to learn more about teachers' views of the state testing policy, mathematics, and other topics. The first round of interviews also used semi-structured interviews, but the second round asked more specific, common, open-ended ques­tions. Each interview took about one class period or 45 to 50 minutes.

The strength of this intensive strategy is that it provides useful data on teachers' understandings of how to teach, how those understandings are operationalized, and the linkage between testing policy and their practice. The problem is that the small samples required for this intensive work make generalizing more than usually difficult. In Mary­land, we were able to supplement our fieldwork by referring to Koretz, Mitchell, Barron, and Keith's (1996) survey study of how over 200 teachers re­sponded to that state's test.

Data Analysis

The research team met periodically during field-work to compare findings, identify preliminary hypotheses, and refine analysis plans. At the end of the field period, all data were entered into a com­puterized qualitative data analysis program and coded. The coding scheme was partially grounded. It began with a set of broad topics reflecting the original study design and questions, but specific subcategories (and new major categories) were developed out of a reading of the data. This coding indexed the complete data set. Further reduction was accomplished by retrieving and reading data in substantive codes, often crossing categories with

demographic variables (Maine vs. Maryland, teacher vs. administrator). This second stage was conducted with specific questions in mind. Often charts were created using quotes or summaries of data for individual respondents, organized by some combination of state, district, and/or position. These charts and other strategies were used to search for patterns in the data and to triangulate across data sources, including teacher and administrator inter­views and teacher interviews and classroom obser­vations. They also allowed us to identify themes and trends in the data and count the frequency with which such themes occurred. As these themes were developed, they provided the basis for developing and testing larger arguments using a process of axial coding as described by Strauss and Corbin (1990).

To examine the 91 classroom observations spe­cifically, we identified three dimensions that best described possible changes in the teaching and learning of mathematics. One of these dimensions emerged from our data, while the other two are grounded in the current literature on the TIMSS videotape study (Stigler & Hiebert, 1997).

To standardize the classroom coding system, all three authors independently analyzed two sets of 12 observation transcripts. We met after reviewing each batch to reconcile minor differences in inter­pretation and to ensure that everyone used com­mon definitions consistently. While coding these 24 observations, we also noted that some lessons were divided into segments in which teachers used significantly different styles of instruction or re­quired student activities that were so different that these segments had to be coded differently. In these rare instances and in the few that occurred among the observations coded later, we decided to ana­lyze the constituent parts rather than the whole pe­riod. The remaining classroom observations were coded independently by two researchers, one of whom was present during the actual lesson and re­corded the field notes. Interrater reliability for these remaining 67 classroom observations was 93.5%.

Findings

In the following pages, we first describe teach­ing to the test and suggest that we saw more of this in Maryland than in Maine. A closer examination of teaching practice, however, suggests comparable shifts in both states. To explore factors that might contribute to the changes observed, we first look at the forces for stability that seemed to be common to both states, the policy factors that contributed to

102

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Performance-Based Assessment and Instructional Change

change in Maryland, and other factors that seemed to promote change in Maine.

Responses to the Test

What was immediately striking in the interview data was that three times as many teachers in Mary­land could describe changes in their teaching made to accommodate the state test as in Maine (15 to 5). Teachers noted a number of specific changes that they made in response to the test. In one school, they reported a collective decision made by math­ematics teachers to emphasize specific curriculum areas on the test like number relationships, mea­surement, and mathematical connections. Even more specific changes included introducing certain vocabulary, putting up a poster on how to format a graph, or teaching how to make stem-and-leaf charts because these issues came up on the test. One teacher remarked, "the first year, the kids were drawing real leaves and putting numbers around them." Thus, the test was clearly influencing the content taught in Maryland.

Whether Maryland teachers were changing their instructional methods was less clear. The closest thing to a methods change came from the several teachers who mentioned "MSPAP activities." MSPAP activities illustrate how "teaching to the test" has the potential to significantly change in­structional practice but also how it can fall short. To varying extents, these activities were test-prepa­ration activities. In the weeks just before the MSPAP was given officially, one teacher did a strict simu­lation of the test with problems he had made up and coached students on what was an appropriate answer in the testing context. At one point, he told students:

Let me warn you! What is a paragraph? Is it two pages? Is it 50 words? Twenty-five words? Longer isn't better. They like it short and sweet in complete sentences. Also you can draw a pic­ture and label it and then use the labels in sen­tences.

At other times, students did projects that were deemed MSPAP-like, but there was no reference to the test at all. These included surveys of student smoking behavior and hands-on activities where students went outside to measure the height of a water tower using different methods.

MSPAP activities were extended projects that used a variety of mathematical and nonmathematical concepts as well as manipulatives

and multiple forms of representation. They provided one of the few breaks from the dominant Ameri­can pattern of teaching using large sets of small, tightly-structured problems focusing on single con­cepts or a limited number of operations (Stigler & Hiebert, 1997).

We emphasize the potential of MSPAP activi­ties because the reality did not always match the promise. Opportunities for mathematical reason­ing were often limited. In one lesson, for instance, students were given a data set on the number of sales for meals at different prices in an evening in a restaurant and asked to organize them and make inferences about how to price meals in the future. Coaching from the teacher suggested that the "cor­rect answer" was to recommend pricing more meals in the range where the most meals had sold in the past, presumably because what had sold heavily in the past would sell a lot in the future. Alternative hypotheses—like the possibility that certain prices were not selected because there were few choices in that range on the menu, that some prices were heavily selected because one item was particularly popular, or that the most popular price range might be too low to generate profits—were not discussed. Thus, what could have been a real-world problem with lots of room for conjecture became highly structured and unrealistic. Yet numerous forces encouraged the teacher to take such an approach. The book giving the problem was itself highly struc­tured, requiring students to create stem-and-leaf charts before doing a histogram rather than figur­ing out on their own how best to display the data. Moreover, the teacher knew from past experience that the test's construction of real-world applica­tions was narrow.

If test-generated teaching activity was flawed in Maryland, it was much less frequent in Maine. Only a third as many Maine teachers described responses to that state's test. There was some discussion of changing the order in which subjects were intro­duced. One teacher said she and her colleagues decided to stress geometry a little more because of a look at test results. However, there was not the sense of urgency in adjusting to the test found in Maryland. One principal explained that

The curriculum isn't driven per se as a result of [the ME A] There is no talk about putting more physical science in sixth and seventh grade because it's on the MEA. They will get it in ninth grade. The one thing we did change is that the required half year of Maine studies used to be

103

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Firestone, Mayrowetz, and Fairman

done with ninth graders. Since we knew it would be on the MEA, we put it in seventh grade when we moved the ninth grade to the high school.

Other teachers talked about having students show their work more by writing out the steps they took in solving problems and labeling answers clearly. If anything, fewer changes in instructional practice seemed to result from the introduction of the MEA than the MSPAP.

Instructional Practice

Further clarification of changes in teaching prac­tice comes from examination of the classroom ob­servations. We looked at three dimensions of in­struction. The first—size of the problems used— was suggested by the attention to MSPAP activi­ties in the Maryland interviews. We reasoned that large problems would provide more opportunities for conjecture and mathematical reasoning than smaller ones. Thus, they could facilitate the devel­opment of mathematical power (NCTM, 1989) even though changing the size of problems alone was not enough to change students' thinking. Large problems required extended effort by students; they often entailed several steps or activities organized around a common theme, problem situation, or math concepts. Lampert's (1990) several-day ex­ploration of the patterns in exponentiated numbers is a large problem. In one large problem we ob­served, the teacher asked students to approximate the area of a circle six different ways—and gave them extra credit for inventing their own—to lead

up to the concept of pi. While this problem gave students considerable room to reason about a math­ematical issue, not all big problems did so. The problem where students used data to price meals was also a big problem. Small problems would in­clude sets of many long division problems, frac­tions to reduce, or polynomials to factor designed to help students master a procedure; these almost never provided opportunities for analytical reason­ing.

Based on our review of past writing on math­ematics teaching in the U.S. and the interview data, we expected to find substantially more use of small problems than large ones. However, because Mary­land teachers talked more about teaching to the test, we expected to find more large problems in that state. Table 4 bears out the first expectations—we saw more than twice as many small problems as large ones—but not the second. Large problems were equally prevalent in both states.

The next two dimensions were derived from the TIMSS video study. Stigler and Hiebert (1997) note that students can be asked to do three kinds of work. First, they can practice solving routine problems when they repeatedly use one or a few procedures. Second, they can apply procedures to situations that are new, either because they require some connec­tion to the real world or because they require con­sideration of new mathematical concepts or situa­tions. Finally, students can invent new procedures and analyze new situations. In these cases, they may have to generate proofs, theorems, or rules. The

TABLE 4 Frequency of Selected Teaching Practices by State

Characteristics of Maine Maryland

mathematics lessons Number % Number % Total number

Problem size Large 16 29 14 32 30 Small 39 71 30 68 69 Total 55 100 44 100 99

Student activity Practice 43 78 37 84 80 Nonpractice 12 22 7 16 19 Total 55 100 44 100 99

Teacher activity Tell procedure 49 89 42 95 91 Develop concept 6 11 2 5 8

Total 55 100 44 100 99

104

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Performance-Based Assessment and Instructional Change

researchers report that American students spend, on average, 96% of their time practicing and less than 1% inventing new procedures. In contrast, Japanese students spend 44% of their time invent­ing procedures and 41% practicing.

Performance-based assessments probably pro­vide few opportunities for inventing new proce­dures, but they should offer extensive occasions for applying known procedures in new settings. For that reason, we expected to see somewhat less prac­tice in these states than has been noted throughout the United States. Moreover, because of the stron­ger stakes and more extensive professional devel­opment in Maryland, we expected to see the least practice in that state. Like Stigler and Hiebert, we found very few instances of inventing procedures, so we collapsed the last two categories in order to distinguish between work that did or did not in­volve practice. We found somewhat less practice than Stigler and Hiebert (80% vs. 96%) but no real difference between the two states. While encour­aging, the first finding must be viewed carefully as we did not videotape classrooms as Stigler and Hiebert did and could not get access to their code books while we were doing our coding, so there may be some differences in how we interpret cat­egories.

Nonpractice activities are supposed to give stu­dents more opportunities than drill and practice to make conjectures and reason to solutions. Some of these activities did. These included a lesson where students were asked to find the area of a circle sev­eral different ways and another where—after play­ing two probability games with dice and discuss­ing what fairness meant in this context—students were given the opportunity to invent their own game. This was not always the case, however. One application problem attempted to link mathemat­ics to social studies by asking students to write re­ports on creatures found at the seashore, including their life habits and annual harvests in terms of both number and volume. Students were expected to use a spreadsheet program to display these data. So much time was spent on the nonmathematical parts of the assignment that the observer wondered if he was in a social studies class, and some students did not present data because the state bureau of fisher­ies did not provide the necessary information.

Another limitation occurred when the problem worked on was a simulation of a state test and the teacher proceduralized the activity to make sure students would get correct answers quickly. In re­

viewing answers to a geometry and measurement question that required students to explain how they would measure the height of a flag pole from its shadow if they knew their own height and had a measurement of their own shadow, the teacher said, "I'm a math teacher, a man of few words. Γd draw pictures." He then drew pictures of the flag pole and shadow and student and shadow and labeled the flag pole "n," its shadow "a," the student "c," and its shadow "b." He said:

I'd write down to set up the proportion "n" is to "a" as "c" is to "b." That wouldn't get me all the points. It would get four of five. Γd have to talk about how you'd measure everything. If you write it all out with capital letters and periods, you should get five points.

In this example, learning how to handle the testing situation overwhelmed opportunities for reasoning about a mathematical situation.

Third, Stigler and Hiebert (1997) note that teach­ers can simply state concepts and procedures or they can develop procedures. In the first case, they may simply tell how to work a type of problem and give some examples. In the second, the teacher might show, for instance, how the formula for the volume of a triangle can be derived by combining two triangles to form a rectangle. Japanese teach­ers tend to develop a concept through a lecture, but American teachers seem more likely to have stu­dents develop an idea through engagement in some hands-on or multistep activity. Having students calculate the area of a circle different ways to un­derstand the value of pi involves students in the work of developing a concept. In another instance, a teacher used a number line and pieces of plastic that represented parts of a whole to help students develop an understanding of fractions.

Stigler and Hiebert (1997) found that 22% of their American teachers developed concepts as opposed to 83% of the Japanese ones. Because tests are likely to influence student work more than teacher pre­sentation—if only because students will be asked to complete test-like problems—we doubted that state assessments would influence teaching ap­proaches as much as student activities. In fact, teach­ers developed concepts less in these two states than in the TIMSS video study; 8% of the observations were of a concept being developed, with no sub­stantial difference between states.

One pattern of behavior that undermined activi­ties that could have involved developing a concept

105

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Firestone, Mayrowetz, and Fairman

was a tendency to start with the answer rather than giving students a chance to develop or discover an idea. Thus, when discussing adjacent and opposite angles, one teacher started out by telling the class that the former equaled 180 degrees. Then, when he had students actually measure the angles, the students appeared more to confirm adult authority than to explore a mathematical phenomenon. This is quite different from the discovery that Lampert (1990) describes with her class on exponents.

This exploration of patterns of instruction sug­gests two observations about the teaching in Maine and Maryland. First, it was not particularly differ­ent from what is found throughout the country. If one combined these three dimensions, the typical American pattern of mathematics teaching should have students practicing procedures using small problems with teachers telling about procedures rather than developing them. In fact, 60% of our classroom observations were of this type. We sus­pect that students applied what they knew to large problems somewhat more often than is typical in most American mathematics classrooms, but it is hard to know. Second, in contrast to the teacher reports on their responses to the test, there were no notable differences between states. Basic math­ematics teaching was much the same in both Maine and Maryland.

Forces for Stability

Two factors contributed to teachers' heavy use of traditional teaching approaches: teachers' under­standings of mathematics and the formal curricu­lum.

Teacher understandings. Teachers' understand­ings of mathematics and why it should be taught in school was often defined through the narrow focus of their districts' curricula and tracking systems. When asked a general question about the nature of mathematics, almost all teachers we interviewed defined the discipline in terms of "school math," the general math or algebra courses they taught, and the different tracks of students they encoun­tered. None of the teachers said anything about what "real" mathematicians or other professionals might actually do with mathematics. The lack of corre­spondence between middle school teachers' views of mathematics and those held by mathematicians or math education experts probably reflects the dif­ferent levels of training in the discipline—only 2 of the 25 teachers had undergraduate majors in mathematics—as well as the lack of opportunities

for these groups to interact. We asked teachers to choose between two defi­

nitions of mathematics: one as "a language to in­vestigate patterns and relationships and to build models" and the second as "a system of procedures and rules to solve problems." Not all teachers saw a clear difference between these two options. Of those who did, more teachers in Maine chose the first option (seven to five), while more teachers in Maryland (six to three) defined mathematics as being concerned with rules and procedures. How­ever, teachers from both states who said that math­ematics involved recognizing patterns and relation­ships also said that this view reflected what they wanted to think math is but indicated that they ac­tually felt constrained to teach in ways that reflected the second definition.

Reinforcing this understanding of mathematics as a system of rules and procedures was the teach­ers' view about instructional strategy. Teachers were asked to choose between one teaching strategy where "the topic is broken down into smaller pieces, and I show them how to move through each step, and they have lots of opportunities to practice" and another where "I give them a larger problem, with mathematical content for them to figure out, and they can get help from me or other students." Teach­ers in both states overwhelmingly chose the first option (16 to 2). One Maine teacher said:

I do the pieces. They can put the pieces together if you give them the pieces at the middle school level. I've given them the whole thing, and they don't seem to know where to start. Once I give them a starting point and break it down, they can handle it better.

Three teachers in each state who said that students must first master the skills before getting more open-ended problems sometimes did big end-of-unit activities to offer students the chance to "ex­tend" or "apply" their knowledge of mathematical procedures and formulae. A teacher in Maine doubted that many of his students could reassemble the "parts" into the "whole" on their own and ac­knowledged the conflict between his approach of teaching the computational steps and procedures with that of the state test (MEA): "I spend more time doing the smaller parts, breaking math into bite-sized chunks so it becomes more understood. The focus of the MEA is in understanding the larger."

We also asked teachers why mathematics should

106

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Performance-Based Assessment and Instructional Change

be taught in school. A majority in both states (10 of 14 in Maine and 8 of 11 in Maryland) talked about the need to teach students "practical applications" for mathematics, rather than teaching mathematics as a way for students to learn how to think logi­cally and analytically. What is striking is their lim­ited definition of practical applications, which gen­erally focused on "shopkeeper math." Teachers' "practical applications" included balancing a check­book, figuring out the amount of discount on a sale item, or correctly measuring ingredients for a cook­ing recipe. These tasks emphasize rather low lev­els of arithmetic skill rather than the more chal­lenging analytical and reasoning skills required for using mathematics in engineering, finance, market­ing, and statistics. While real-life tasks, such prob­lems still primarily develop students' knowledge of computational procedures rather than their abil­ity to reason more inductively and to see the con­nections between mathematics and other disci­plines, particularly as they are encountered in the professional workplace.

Another important aspect of teachers' thinking was their view that students differed in their apti­tude for and need for more advanced mathematics, an issue noted by other researchers (e.g., Grossman & Stodolsky, 1994). Teachers doubted that all stu­dents needed to learn more sophisticated math skills if they will not pursue a college education. Many teachers in both states said that the type of math skills and concepts they teach differ according to the track or ability levels of their students and that they do not believe students can handle more open-ended problem-solving and inductive reasoning until they have first mastered the basic computa­tional skills they will need to employ. One Mary­land teacher said:

Most of the kids I teach, they need to learn the procedures and rules in order to, you know, end up solving a larger picture. They basically can't find the patterns themselves and things like that. It has to be the other way around, I think.

Other teachers said that their academic expectations for students influence the mathematical topics and skills that they emphasize in different ability tracks. One offered: "actually that's the distinguishing fac­tor between the accelerated group and any of the other groups. They get more theory."

Koretz et al. (1996) also found that Maryland teachers had low expectations for low-achieving students. There, teachers reported raising their ex­

pectations more for the highest group of students than other groups since the implementation of MSPAR Eighth-grade math teachers agreed that emphasis on high standards helped the top students but harmed lower ones. Although most teachers supported performance-based assessment, only 21% of the fifth- and eighth-grade teachers agreed that all students can learn to the same high levels, and 88% felt that the lowest MSPAP proficiency level is too high for some students.

Curriculum. Another factor reinforcing tradi­tional teaching was the curriculum. The curricular pressure is stated most clearly by a teacher who said that his job is to "get them ready for algebra. That's my personal opinion. You know, not to get them ready to pass the MEAs. That's not what I'm here for." To some extent, a middle school math­ematics teacher's worth is measured by one's peers by how well prepared students are for the high school curriculum, especially algebra. During this period, the high school algebra curriculum did not appear to be changing in any significant fashion in a direction that reflected either the NCTM stan­dards or the content of the state tests. Teachers in both states said that their algebra courses remained very traditional, with the emphasis on memorizing algorithms.

Eight mathematics teachers (three in Maryland and five in Maine) specifically mentioned that they felt pressure to get students through or ready for high school algebra. In addition, especially in Mary­land (five vs. one), teachers felt a conflict between requirements to prepare students for the state test and for high school algebra, saying things like

There's not enough time to go through all the algebra and do the MSPAP tests.

This second class you observed... will lead into essentials of algebra next year, and there's very little probability in our algebra curriculum right now, so I have to make sure that they get it well enough to remember it to the end of next year when they see it on the MSPAP because it's al­ways probability.

Generally, most teachers thought their district curricula were not well aligned with the state tests. They held that "I don't think there's a textbook out there that helps you prepare for the MEA or im­prove MEA scores" or "[t]here's been times when they asked questions that were not in our curricu­lum, so we had to develop units on that." A few,

107

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Firestone, Mayrowetz, and Fairman

however, said, "It's the MSPAP with 40 suboutcomes that drives our curriculum." Especially in Maryland, individual teachers and groups within schools developed special units that were designed all or in part to get students ready for the state test. These units generally involved larger problems and sometimes had students apply their knowledge to real-world data.

One reason for less-than-perfect alignment be­tween formal curricula and the state test is that dis­tricts revised curricula on a rotating schedule. Af­ter the introduction of the new state tests, some were just revising their mathematics curricula for the first time. Another reason, noted by both teachers and administrators in the two poorest districts—Fron­tier and Farm Town—was a lack of funding to buy new materials and books.

Policy and Change in Maryland

If the sources of inertia are similar in the two states, the factors promoting change seem some­what different. Generally, the policy context favored change more in Maryland because in that state the test was more challenging, the stakes were higher, and there were more structured learning opportu­nities.

The tests. The MSPAP moved farther away from requiring students to practice using standard pro­cedures on small problems than did the MEA. For instance, MEA items focus on single mathemati­cal topics like number sense, measurement, or prob­ability. MSPAP items exemplify large problems; they are longer and combine topics. One item as­sessed knowledge of arithmetic, estimation, and probability and statistics simultaneously. Others combine mathematics with other school topics like science, which the MEA never does.

The two tests also differ in the expectations they make of students beyond computation, which is assessed on both tests. The MEA's expectations are more constrained. MEA questions ask students to calculate the difference in travel costs for different size families and different rates of interest paid. Geometry questions require finding the area of a figure. Often tasks require straightforward arith­metic computation for which there is one obvious, best method so students have little opportunity to compare different methods or invent strategies that are new to them. Responses are further constrained when the problem requires reporting the solution in a specific format. Scoring guides often indicate the correct graph or other representation and award

the highest score for use of "correct solution strat­egy," although a few items allow students to choose an "appropriate solution strategy." Thus, these items do not move far from conventional practice-type items.

In contrast to the MEA, MSPAP items and scor­ing guides appear to be more open ended, allowing more room for application and invention. Students are not usually told what pictorial representation to construct, nor are they provided with charts or other figures to complete. For each item, the context of a problem is described in a written introduction, and students must perform a series of tasks related to the initial information provided. They may construct a figure of their own choice, record data from ob­servations or experiments they conduct, or explain and justify their solutions. There are repeated prompts to give written justifications for solutions.

One problem provides students with statistics about the percentage of people born in each month, the number of students in a school, the number of students taking the MSPAP statewide, and the per­centage of people preferring certain colors. In a series of tasks, students must analyze the probabil­ity of sharing their birth month and color prefer­ences with students in their school or state. In suc­cessive steps, students must explain their reason­ing, explain a prediction about probability, describe how they would test their prediction and display data, and justify their chosen method for data com­munication. Overall, MSPAP problems allow more student choice over problem-solving approaches and consistently require students to communicate their reasoning verbally as well as mathematically.

Stakes. We have already indicated that reconsti-tution makes for higher stakes in Maryland. Still, it was only a moderate threat. Chesapeake's test scores were in the middle of the state distribution. Frontier's were lower but not in the bottom quintile. Thus, many schools would have to be reconstituted before the state would get to any of the schools in this study. In fact, one principal said that reconsti-tution is "in the back of your mind. It's a reflection on the school. We're above the state level, and I don't think anyone in this county has to worry about it." When we probed further about this threat with another administrator, we were told, "For most schools, reconstitution is not perceived as a threat. But even if the fire isn't on your block, if it's three blocks away, you check your fire insurance."

Although reconstitution was improbable in these districts, it got educators' attention. They felt some

108

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Performance-Based Assessment and Instructional Change

pressure to accommodate to MSPAP. Representa­tive comments from Maryland administrators are

[t]he bottom line is that we will meet those stan­dards. We will prioritize our activities to make sure we do.

so together the [curriculum framework] and MSPAP are the Bible of what we have to teach. [italics added]

Teachers said, "There are certain percentages that on the MSPAP that you should have, like that 95% of your students should be getting da, da, da . . . and we are like 70-some.... We're still not up to where the state thinks we should be." Four Mary­land teachers and five administrators made state­ments indicating that they felt compelled to respond to the MSPAP, and four teachers (some of the same ones) said they resented the state pressure. In addi­tion, two teachers said the MSPAP had some influ­ence, and no one denied that it had any influence on what they did.

By contrast, one Maine administrator we visited did not open his report on the district's MEA re­sults until we came. Administrators and teachers said, "I don't place much emphasis on the MEA," "I haven't pursued any new professional develop­ment because of MEA," and "MEA is not what drives teachers. My perception is teachers don't place so much stock in the MEA that will move them to do something differently." Four teachers and one administrator said the MEA had some in­fluence on their work, three teachers (again there was some overlap) saw the MEA as providing only symbolic influence, and three teachers and one administrator denied that it had any influence. Just as no one denied the influence of the state test in Maryland, no one felt compelled to respond to state demands in Maine.

Structured learning opportunities. The state-level learning opportunities for teachers differed very little between states. There appeared to be some­what more in Maryland, but both our data and Koretz and colleagues' (1996) survey note that rela­tively few teachers in that state have access to those opportunities. The real difference between the states was in patterns of district in-service training. Only two Maine teachers described professional devel­opment activities related to the MEA, and these appeared to be relatively casual events. By contrast, Maryland teachers said, as one put it, that "[t]he math in-service days are usually about the MSPAP and how to improve the test scores." These pro­

grams seemed to differ from conventional in-ser­vice days in the extent to which they were used for some kind of collective work or problem-solving work among teachers rather than being talked at by experts as so commonly happens during in-ser­vice days (although that happened, too). One teacher explained:

Middle school math teachers. We probably have about, maybe 18 to 20, and what we're go­ing to do at one of our future meetings next year is each teacher's going to bring three tasks that they use, and with 20 teachers, you know, in one day, you're going to be given 60 tasks that you can use as is [or] build on, you know.

In fact, six teachers mentioned that their in-service days or other scheduled meeting times during the year were useful for developing or exchanging MSPAP activities. Moreover, teachers were quite proud of the MSPAP activities they developed and seemed to feel that their work was enriched by this kind of activity. Other teachers mentioned that they used workshop days for strategizing about areas to focus on.

As laudable as the hands-on involvement offered by these local planning and development efforts was, it had two limitations. First, access to experi­ences that would help teachers move away from a view of mathematics as a set of fixed procedures to be practiced to one where students could explore big ideas through unstructured problems remained constrained. This was partly because teachers could rarely leave their districts to take advantage of pro­fessional development opportunities in other loca­tions (although their curriculum coordinators may have had more). Second, although Maryland dis­tricts worked hard to orchestrate learning opportu­nities for teachers, a number of teachers refused to take advantage of them, in spite of the moderately high stakes in place. As one said

The more you stand still in education, the far­ther ahead you are. I've been through "think metrics," basic math, new math, grouping homo­geneously. I've survived 25 years of all these things. We keep coming back to the same things. The method that works the best and that I'm most comfortable with is teaching the basic fundamen­tals, teaching times tables, teaching different mea­sures.

Individual Learning in Maine

While policies seemed to be more focused on making changes linked to assessments in Maryland,

109

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Firestone, Mayrowetz, and Fairman

teaching practices appeared to be quite similar in the two states. One reason for these similarities may be that the presence of any form of performance-based assessment contributes to modest change. The lack of concern shown about or interest in the ME A combined with a certain resistance to "teaching to the test" suggests that that explanation alone is rather implausible. Equally important, we suspect that the opportunities to learn about new approaches to teaching mathematics were comparable, but or­ganized differently in the two states.

Districts in Maryland focused their professional development resources on helping teachers align their instruction with the MSPAP. In one district, Chesapeake, the superintendent said the board's goals for him were "derived from the Maryland Schools Performance Program and other local pri­orities. You know, attendance, drop out, curricu­lum, buildings. . . . It's primarily instructionally related." Responding to state testing goals was seen as a legal necessity. When asked how useful the state testing program was, the assistant superinten­dent for curriculum said, "'Useful' is not appropri­ate. I look at it as a mandate. That's the yardstick we are using to measuring success." He went on to explain that "assessment is driving what we are doing" and that seemed to describe the district's professional development program in mathemat­ics.

Maine districts were much less focused in their approach. In Factory Town, there was little if any district instructional leadership because the super­intendent was concentrating on facilities improve­ment. The middle school principal said, "I will not teach to the test to enhance scores."

Not surprisingly, teachers used different teach­ing approaches, although they did follow a com­mon curriculum. Two of the three math teachers used traditional recitation approaches featuring short problems from standard textbooks to drill on specific procedures. One explained that mathemat­ics was like driving and "you have to follow the rules of the road in order to get and keep your driver's license." These were older teachers who did not seek out opportunities to improve their math­ematics teaching. The third, who was younger, also defined mathematics as a set of procedures, but he avoided textbooks in favor of cooperative learning strategies with a wider variety of problem types. Children had more opportunity to explain problems to each other, to explore problems, and to find con­nections to other subjects and practical applications

than classes taught by the two other teachers. Some of the activities he used in class came from sum­mer workshops he had attended.

Only the middle school language arts teachers in Factory Town shared an approach to teaching. Heavily influenced by whole-language instruction, these teachers integrated reading and writing in­struction around such themes as Maine studies or folklore and avoided teaching grammar directly. These changes did not result from administrative initiative; in fact, teachers only got the release time to modify their approach after requesting resources from the school board themselves. While the prin­cipal said that this process writing approach raised MEA test scores in the language arts area, that was not the reason the change was made. The impetus came when the department head went back to a university in the area to get her master's degree.

Farm Town's school board indicated more con­cern with MEA scores than did Factory Town. The board worried about the high school's scores, which not only were embarrassingly low, but also were the subject of a newspaper series comparing Farm Town's high school with a neighboring one that scored considerably higher even though children in both schools had similar social backgrounds. Nevertheless, the district administration had no plan for improving MEA scores, and most principals did not make test scores a high priority. The pri­mary administrative leadership for instructional improvement came from the Chapter 1 coordina­tor, who focused on language arts in the lower grades.

People talked about two pockets of excellence in Farm Town. The first was lower-grades language arts instruction, which was influenced by the Chap­ter 1 coordinator. The other was the seventh-grade team in the largest middle school, which, accord­ing to the principal, did "team teaching with inte­grated units on themes." One teacher in this group was extremely skilled at treating mathematics as a means to investigating patterns. She was exceptional at introducing tasks that allowed students to apply procedural knowledge to new situations. A few other teachers in the district tried to use puzzles and large, complex problems but could rarely use them to get to serious mathematical content.

As in Factory Town, however, there were few indications that the changes we observed were re­sponses to the state testing program or (except in the lower grades) to any district initiative. Instead, they came in both districts when individual teach-

110

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Performance-Based Assessment and Instructional Change

ers sought out learning opportunities outside the district both from local institutions of higher edu­cation and from statewide professional development workshops funded by National Science Foundation's state systemic initiative or through Eisenhower funds. These federally funded pro­grams were mentioned at about the same rate in both states.

Conclusion

One must be careful when generalizing from these data, which focus on one subject area, one grade level, and a small number of teachers in a few districts. Still, the closer look at the classroom that such in-depth analysis facilitates lends support to the view that the effects of state testing on teach­ing may be overrated by both advocates and oppo­nents of such policies. Our research reinforces the position of Cuban (1993) and Stigler and Hiebert (1997) that rather strong forces in the educational system maintain an approach to teaching that em­phasizes practicing on many, small problems and shallow coverage of many topics.

Beyond this general conclusion, this work sug­gests two hypotheses that deserve exploration with larger samples of teachers. The first is that under some circumstances, performance-based assess­ment can change specific behaviors and procedures in the classroom more easily than general paradigms for teaching a subject. Teachers did report that they changed the order in which content areas were in­troduced to respond to the test, and they did so more often in Maryland, where the stakes were higher. Yet the instructional methods for teaching math­ematics did not change greatly in either state. Even when teachers pointed to specific adjustments, like MSPAP activities, those changes accommodated deep-seated approaches to mathematics teaching in which teachers described procedures and had children practice them.

The second hypothesis is that state assessment policies do more to organize existing learning op­portunities for teachers than to increase them. State policies are not the only forces promoting change in instructional practice, especially in mathemat­ics. Colleges and universities, professional associa­tions, and federally funded programs for teacher improvement are among the many sources of in­formation about how to change practice. These are all important because our work, as well as past stud­ies (Cohen, 1995), suggests that changing instruc­tional paradigms requires considerable learning on

the part of teachers. State assessments may con­tribute to learning opportunities by providing new models of student work. However, these models are most likely to change practice if teachers have opportunities to reflect on them in fight of deeper understandings of what mathematics is and how to teach it. The question, then, is whether state testing policies, and the stakes attached to them, either in­crease the opportunities for such reflection or mo­tivate people to participate in them.

Our observations suggest that high-stakes assess­ment may encourage some people to think about how to change practice, but its motivational effects are limited, and it does not increase opportunities to do so. In spite of all the central structuring, Mary­land teachers who opposed this new approach to instruction could still ignore the professional de­velopment that was offered. In fact, the central struc­turing seemed more to "capture" available oppor­tunities and provide a focus for how they are used than to increase what was available. Moreover, the tighter central organization of professional devel­opment opportunities in Maryland appeared to be a mixed blessing. On one hand, almost everything in Maryland focused on MSPAP preparation, so there was less waste of time on irrelevant issues like CPR; on the other, Maryland teachers had less opportunity to seek out personally relevant learn­ing opportunities than their Maine colleagues. All of the orchestration of learning opportunities that we saw in Maryland may have been a symbolic response that demonstrated compliance with state expectations but provided little more access to knowledge than Maine's more individualistic ap­proach.

Within the limits of the research design, this study provides some opportunity to reflect on two pre­scriptions for using assessments to improve instruc­tional practice. Some advocates of improved as­sessments recommend having teachers mark stu­dent papers and otherwise work more closely with assessments as a way of improving their instruc­tional knowledge (e.g., Murnane & Levy, 1996). We suspect that this strategy is more effective when states do portfolio assessment so that scorers can examine a wide variety of student work and teacher assignments. It may also be useful where the state assessments are very good problems for promot­ing conjecture and reasoning among students. The comparison of Maine's and Maryland's assessments suggests the need for some more careful analysis of the mathematical demands and instructional

111

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Firestone, Mayrowe z, and Fairman

properties of the assessments that different states now use.

Many opponents of state assessments argue that high stakes encourage teachers to view raising test scores as an end in itself. While this is possible in Maryland, it is hard to argue that teachers worried greatly about possible sanctions. Without denying that high stakes can have deleterious effects on teaching, especially in urban schools where the pressure to produce quickly is highest, we suspect that the issue may be somewhat overrated.

More important, both fiddling with stakes and using assessments as teaching tools seem to miss the major point, which is that teachers lack the deep understanding of mathematics to teach in ways that help students learn to reason mathematically while calculating accurately. Teachers also need to better understand how students make sense of and learn from mathematical problems. Improving teachers' fundamental knowledge cannot be improved through changing sanctions, nor is it likely to come through further consideration of state assessment problems and student work, except under special circumstances. Assessment policy will not get around the need to ensure that teachers have a solid foundation in the subjects they teach and clear un­derstanding of how to help children learn those subjects.

Notes

This article was presented at the annual conference of the American Educational Research Association in Chi­cago in March 1997. The work reported here was sup­ported by a grant from the Spencer Foundation. The opinions expressed are those of the authors and not of the Foundation or of Rutgers University.

lrΓhe first kind of district serves a town with a unified school administration and budget; often the town coun­cil has final budget approval. SADs have a single ad­ministration, budget, and school board, but member municipalities share costs based on a formula. Supervi­sory unions are entities where one central office pro­vides administrative services to two or more towns and their separate boards, each of which keeps a separate budget.

2Our discussion of state assessments is based on re­views of the MEAs from 1993-1994 through 1996-1997. Most items were released each year. For Maryland, how­ever, we examined only two public release items—a math/science item from 1992 and a math item from 1994. We also reviewed some items administered to younger students. This limited availability reflects Maryland's generally tight policy of test security.

References

Ball, D. L., & Rundquist, S. S. (1993). Collaboration as a context for joining teacher learning with learning about teaching. In D. K. Cohen, M. W. McLaughlin, & J. Talbert (Eds.), Teaching for understanding: Chal­lenges for policy and practice (pp. 13-42). San Fran­cisco: Jossey-Bass.

Baron, J. B., & Wolf, D. P. (1996). Performance-based student assessment: Challenges and possibilities. Chicago: University of Chicago.

Cohen, D. K. (1995). What is the system in systemic reform? Educational Researcher, 24(9), 11-17.

Cohen, D. K., & Ball, D. L. (1990). Relations between policy and practice: A commentary. Educational Evaluation and Policy Analysis, 12(3), 249-256.

Corbett, H. D., & Wilson, B. L. (1991). Testing, reform, and rebellion. Norwood, NJ: Ablex.

Corcoran, T, & Goertz, M. (1995). Instructional capac­ity and high-performance standards. Educational Re­searcher, 24(9), 21-3\.

Cuban, L. (1993). How teachers taught: Constancy and change in American classrooms, 1890-1980 (2nd ed.). New York: Teachers College Press.

Elmore, R. F. (1996). Getting to scale with successful educational practices. In S. H. Fuhrman & J. A. O'Day (Eds.), Rewards and reform: Creating educational incentives that work (pp. 294-329). San Francisco: Jossey-Bass.

Fennema, E., Carpenter, T P., & Peterson, P. L. (1989). Learning mathematics with understanding: Cognitively guided instruction. Advances in Research on Teaching, 1, 195-221.

Firestone, W A., & Pennell, J. R. (1997). Designing state-sponsored teacher networks: A comparison of two cases. American Educational Research Journal, 34(2), 237-266.

Fullan, M. (1991). The new meaning of educational change. New York: Teachers College Press.

Goodlad, J. A. (1984). A place called school. New York: McGraw-Hill.

Grossman, P. L., & Stodolsky, S. S. (1994). Consider­ations of content and the circumstances of secondary school teaching. In L. Darling-Hammond (Ed.), Re­view of research in education (Vol. 20, pp. 179-222). Washington, DC: American Educational Research Association.

Koretz, D., Mitchell, K., Barron, S., & Keith, S. (1996). Final report: Perceived effects of the Maryland school performance assessment program (Project 3.2). Los Angeles, C A: National Center for Research on Evalu­ation, Standards, and Student Testing.

Lampert, M. (1990). When the problem is not the ques­tion and the solution is not the answer: Mathematical knowing and teaching. American Educational Re­search Journal, 27(ì), 29-63.

Madaus, G. F. (1988). The influence of testing on the

112

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from

Performance-Based Assessment and Instructional Change

curriculum. In. L. Tanner (Ed.), Critical issues in cur­riculum: 87th yearbook of the National Society for the Study of Education (pp. 83-121). Chicago: Uni­versity of Chicago.

Maine Department of Education. (1994, October 20). Preparing students for the new MEA. Augusta, ME: Author.

McDonnell, L. M. (1994). Assessment policy as persua­sion and regulation. American Journal of Education, 102, 39Φ-‡2O.

McLaughlin, M. W. (1987). Learning from experience: Lessons from policy implementation. Educational Evaluation and Policy Analysis, 9(2), 171-178.

Murnane, R. J., & Levy, F. (1996). Teaching to new stan­dards. In S. H. Fuhrman & J. A. O'Day (Eds.), Re­wards and reform: Creating educational incentives that work (pp. 257-292). San Francisco: Jossey-Bass.

National Council of Teachers of Mathematics. (1989). Curriculum and evaluation standards for school math­ematics. Reston, VA: Author.

Patton, M. Q. (1990). Qualitative evaluation and research methods. Newbury Park, CA: Sage.

Popham, J. (1987). The merits of measurement driven instruction. Phi Delta Kappan, 68, 679-682.

Powell, A. G., Farrar, E., & Cohen, D. K. (1985). The shopping mall high school. Boston: Houghton, Mifïlin.

Radnor, H., & Shaw, K. (1995). Developing a collabo­rative approach to moderation. In H. Torrance (Ed.), Evaluating authentic assessment (pp. 124-143). Buckingham, England: Open University Press.

Rothman, R. (1995). Measuring up: Standards, assess­ment, and school reform. San Francisco: Jossey-Bass.

Schmidt, W. H., McKnight, C. C, & Raizen, S. A. (1996). A splintered vision: An investigation of U.S. science and mathematics education. East Lansing, MI: U.S. National Research Center for the Third International Mathematics and Science Study.

Shulman, L. (1987). Knowledge and teaching: Founda­tions of the new reform. Harvard Education Review, 57(1), 1-22.

Smith, M. L. (1991). Put to the test: The effects of exter­nal testing on students. Educational Researcher, 20(5), 8-12.

Smith, M. L. (1996). Reforming schools by reforming assessment: Consequences of the Arizona Student Assessment Program. Tempe, AZ: Southwest Educa­tional Policy Studies, Arizona State University.

Smith, M. S., & O'Day, J. (1991). Systemic school re­form. In S. Fuhrman & B. Malen (Eds.), The politics of curriculum and testing (pp. 233-268). Philadelphia: Falmer Press.

Stigler, J. W., & Hiebert, J. (1997). Understanding and improving classroom mathematics instruction. Phi Delta Kappan, 79{X), 14-21.

Strauss, A., & Corbin, J. (1990). Basics of qualitative research: Grounded theory procedures and tech­niques. Newbury Park, CA: Sage.

Authors

WILLIAM A. FIRESTONE is director of the Center for Educational Policy Analysis and professor of educa­tional policy at the Rutgers Graduate School of Educa­tion, 10 Seminary Place, New Brunswick, NJ 08903. He is interested in the effects of policies and administra­tion on teaching practice.

DAVID MAYROWETZ is a doctoral candidate in the Department of Educational Theory, Policy, and Admin­istration at Rutgers University and a research associate at the Center for Educational Policy Analysis at 10 Semi­nary Place, Room 229, New Brunswick, NJ 08903; e-mail [email protected]. His areas of inter­est are policy formation, inclusion of students with spe­cial needs, and assessment reform.

JANET C. FAIRMAN is a doctoral candidate in the interdisciplinary Ph.D. program in educational policy at Rutgers University and a research associate at the Cen­ter for Educational Policy Analysis; e-mail [email protected] or 315 South First Avenue, Highland Park, NJ 08904. She specializes in policy evalu­ation, alternative assessment, and math education.

Manuscript received June 24,1997 Revision received January 25, 1998

Accepted February 7,1998

113

by William Firestone on March 19, 2009 http://eepa.aera.netDownloaded from