Assessing Knowledge Integration in Science: Construct, Measures, and Evidence

25
This article was downloaded by: [ETS] On: 26 October 2011, At: 13:48 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Educational Assessment Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/heda20 Assessing Knowledge Integration in Science: Construct, Measures, and Evidence Ou Lydia Liu a , Hee-Sun Lee b , Carolyn Hofstetter c & Marcia C. Linn a a University of California, Berkeley b Tufts University c University of California, San Diego Available online: 15 Apr 2008 To cite this article: Ou Lydia Liu, Hee-Sun Lee, Carolyn Hofstetter & Marcia C. Linn (2008): Assessing Knowledge Integration in Science: Construct, Measures, and Evidence, Educational Assessment, 13:1, 33-55 To link to this article: http://dx.doi.org/10.1080/10627190801968224 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms- and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be

Transcript of Assessing Knowledge Integration in Science: Construct, Measures, and Evidence

This article was downloaded by: [ETS]On: 26 October 2011, At: 13:48Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,UK

Educational AssessmentPublication details, including instructions forauthors and subscription information:http://www.tandfonline.com/loi/heda20

Assessing KnowledgeIntegration in Science:Construct, Measures, andEvidenceOu Lydia Liu a , Hee-Sun Lee b , Carolyn Hofstetter c

& Marcia C. Linn aa University of California, Berkeleyb Tufts Universityc University of California, San Diego

Available online: 15 Apr 2008

To cite this article: Ou Lydia Liu, Hee-Sun Lee, Carolyn Hofstetter & Marcia C.Linn (2008): Assessing Knowledge Integration in Science: Construct, Measures, andEvidence, Educational Assessment, 13:1, 33-55

To link to this article: http://dx.doi.org/10.1080/10627190801968224

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes.Any substantial or systematic reproduction, redistribution, reselling, loan,sub-licensing, systematic supply, or distribution in any form to anyone isexpressly forbidden.

The publisher does not give any warranty express or implied or make anyrepresentation that the contents will be complete or accurate or up todate. The accuracy of any instructions, formulae, and drug doses should be

independently verified with primary sources. The publisher shall not be liablefor any loss, actions, claims, proceedings, demand, or costs or damageswhatsoever or howsoever caused arising directly or indirectly in connectionwith or arising out of the use of this material.

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

Assessing Knowledge Integrationin Science: Construct, Measures,

and Evidence

Ou Lydia LiuUniversity of California, Berkeley

Hee-Sun LeeTufts University

Carolyn HofstetterUniversity of California, San Diego

Marcia C. LinnUniversity of California, Berkeley

In response to the demand for sound science assessments, this article presents the de-velopment of a latent construct called knowledge integration as an effective measureof science inquiry. Knowledge integration assessments ask students to link, distin-guish, evaluate, and organize their ideas about complex scientific topics. The articlefocuses on assessment topics commonly taught in 6th- through 12th-grade classes.Items from both published standardized tests and previous knowledge integration re-search were examined in 6 subject-area tests. Results from Rasch partial credit analy-ses revealed that the tests exhibited satisfactory psychometric properties with respectto internal consistency, item fit, weighted likelihood estimates, discrimination, anddifferential item functioning. Compared with items coded using dichotomous scor-ing rubrics, those coded with the knowledge integration rubrics yielded significantlyhigher discrimination indexes. The knowledge integration assessment tasks, ana-lyzed using knowledge integration scoring rubrics, demonstrate strong promise as ef-fective measures of complex science reasoning in varied science domains.

Educational Assessment, 13:33–55, 2008Copyright © Taylor & Francis Group, LLCISSN: 1062–7197 print/1532-6977 onlineDOI: 10.1080/10627190801968224

Correspondence should be addressed to Ou Lydia Liu, Educational Testing Service, 666 RosedaleRoad MS R-16, Princeton, NJ 08541. E-mail: [email protected]

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

Although science inquiry is highly valued in science education, it has not beenmeasured with the rigor it deserves. Traditionally, science assessments in nationaland international tests rely heavily on multiple-choice items that mostly requirethe recollection of scientific information. These assessments motivate teachers tofocus on drill and memorization strategies, neglecting student critical thinking(Yeh, 2006). This problem has generated a call for more authentic assessments thatemphasize complex thinking (National Research Council, 1996, 2000; Pellegrino,Chudowsky, & Glaser, 2001).

In this article we use a knowledge integration framework to design and scoreitems to measure complex science reasoning, and we show how this approach canprovide a more valid indicator of advanced scientific accomplishments. The con-struct of knowledge integration is defined as the process of adding, distinguishing,evaluating, and sorting out accounts of phenomena, situations, and abstractions inscience (Linn, Davis, & Bell, 2004). It focuses on how learners grapple with multi-ple and conflicting ideas about scientific phenomena, how they develop new ideas,and how they sort out connections between new and existing ideas to reach morenormative and coherent understanding (see Linn, 1995, 1996; Linn et al., 2004;Linn & Hsi, 2000). This view emphasizes the arguments students construct amongtheir normative ideas and the ways students use their ideas to interpret new situa-tions. Thus, the knowledge integration framework focuses on lifelong sciencelearning.

NATIONAL AND INTERNATIONAL TESTS:NAEP, TIMSS, AND PISA

Many national and international tests, such as the National Assessment of Educa-tional Progress (NAEP), the Trends in International Mathematics and ScienceStudy (TIMSS), and the Program for International Student Assessment (PISA),have made efforts to develop items that measure complex scientific thinking. Theeffectiveness of these efforts has been thwarted for various reasons. First, some ofthe tests need to align with state and national curriculum standards that differ sub-stantially. Second, the tests need to address a broad range of content within a lim-ited testing period. Third, the tests need to use primarily multiple-choice items be-cause of the cost of administration and scoring. These factors have precludednational and international tests from fully utilizing items that may provide in-depthmeasures of complex thinking.

For example, the most demanding NAEP items at the “Integrated specializedscientific information” level are intended to measure whether students can infer re-lationships using specific scientific knowledge, link scientific ideas to social is-sues, and combine ideas to draw a conclusion. Research has shown that only a fewNAEP items actually measure this level of understanding, and few students

34 LIU, LEE, HOFSTETTER, LINN

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

actually achieve this level (National Center for Education Statistics, 2005). Fur-thermore, NAEP items are coded to measure inquiry separately from science con-tent rather than constructed to measure the use of inquiry strategies in complex sci-ence domains. As a result, advanced NAEP items generally ask students tointerpret data from tables and make inferences from experimental results. Theseitems often measure logical reasoning instead of the intended complex reasoningabout advanced science content. Items measuring basic and proficient perfor-mance tend to assess student ability to recall isolated facts and apply simplifiedprinciples instead of asking students to make inferences or combine several piecesof information. Because the average student performance in the United States doesnot reach the proficient performance level, students are primarily responding toitems measuring basic understanding (National Center for Education Statistics,2005). This approach may actually underestimate student progress because it em-phasizes recall over using science information to interpret scientific events.

The TIMSS framework emphasizes student understanding of simple and com-plex information, problem solving, using the tools of science, and investigating thenatural world. Approximately one fourth of the TIMSS science items require ex-planations; however, more than 85% of these items are scored as “complete” or“incomplete” based on whether the respondent provides the necessary informa-tion. As a result, although the TIMSS seeks to measure complex reasoning in sci-ence, the items primarily assess the recall of discrete information that can be mea-sured dichotomously and/or fail to capture the nuances between inaccurate andnormative arguments (Wang, 2006).

The PISA assessments define science literacy as “the capacity to use scientificknowledge, to identify questions and to draw evidence-based conclusions in orderto understand and help make decisions about the natural world and the changesmade to it through human capacity” (Organisation for Economic Co-operation andDevelopment, 2003, p. 133). PISA focuses on items that require students to applyscientific knowledge in authentic contexts. Scoring of PISA science items, how-ever, follows the TIMSS practice of using dichotomous scores or assigning partialcredit based on recall of relevant information rather than on inferences connectingthe information (i.e., answers that mention two points receive full credit; answersthat mention one of the points receive partial credit; answers that mention neitherof the two points receive no credit). This scoring scheme emphasizes the quantityof points mentioned rather than the quality of the justification.

In summary, both national and international assessments call for items that aresensitive to complex reasoning about relevant science content. Unfortunately, fewitems effectively measure complex science reasoning, and only a few studentsachieve this level of competence on the available items. This means that knowl-edge integration goals are neglected in scoring students’ constructed responses. Asa result, instructors are not motivated to help students learn how to link and con-nect ideas in science. This article describes one attempt to remedy this situation

ASSESSING KNOWLEDGE INTEGRATION 35

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

and help instructional designers as well as teachers emphasize complex science ex-periences that promote coherent and generative understanding of science.

PROMISING ASSESSMENT FRAMEWORKS: BIOKIDS,PADI, AND BEAR

Recent assessment research offers some creative directions for measuring complexreasoning. We highlight BioKids, the Principled Assessment Design for Inquiry(PADI) system, and work at the Berkeley Evaluation and Assessment Research(BEAR) Center.

The BioKids program (Songer & Gotwals, 2006) provides a good example ofhow assessments can be designed to promote inquiry-based curriculum and in-struction. BioKids has developed curricular units that span 8 consecutive weeks toaddress high-poverty and urban students’ underachievement in complex science,focusing on fourth through eighth graders. Typically, BioKids curriculum tasksand assessments ask students to make inferences about claims and explanations.The BioKids items are classified into three categories, contingent upon the level ofscaffolding the students receive. The easiest items involve minimum content andare highly scaffolded. These items often provide the student with the claims and re-quire the student to choose an explanation from multiple choices. The medium dif-ficulty items ask students to choose a claim from two or more possible choices andexplain their selection through an open-ended response. The most difficult itemsrequire students to develop both claims and explanations without any scaffolding.The multiple-choice items are coded dichotomously; open-ended explanations arecoded in three steps: incorrect, partial, or adequate. The scoring of the explanatoryresponses is based on the amount of evidence provided in the explanation or the co-ordination between the data and content in the responses.

An additional noteworthy assessment system for science inquiry is the PADI.Guided by Geneva Haertel and Robert Mislevy, the PADI assessments are based onthe Evidence Centered Design framework (Mislevy, 2003), which includes threeessential components: student models, evidence models, and task models. This de-sign is consistent with Pellegrino et al.’s (2001) assessment triangle of cognition,observation, and interpretation. More specifically, the student model defines thelatent traits or constructs that the assessments are designed to measure. The evi-dence model determines the performance level parallel to the level of studentachievement as defined in the student model. Finally, the task model specifies pro-cedures for the development of assessment tasks that are aligned with the studentand the evidence models. Thus, the Evidence Centered Design framework pro-vides coherent guidelines for construct development, assessment design, and inter-pretation of results.

36 LIU, LEE, HOFSTETTER, LINN

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

Similarly, the BEAR Center has also focused on the importance of consistence,coherence, and continuity in science assessments (Wilson & Sloane, 2000). TheBEAR assessment system advanced the concept of Four Building Blocks, consist-ing of construct map, item design, outcome space, and measurement models. Con-struct maps describe the developmental stages of scientific understanding andalign these stages with curriculum learning goals. The item design emphasizes thatthere are multiple forms of assessments that can effectively elicit students’ knowl-edge and misconceptions. The outcome space describes and ranks the potentialstudent responses in line with the construct map. Last, the measurement model isselected to provide adequate evidence of student performance as well as to drawmeaningful inferences from data. Assessments based on the Four Building Blockscan detect strengths and weaknesses in student performance as well as recognizetypical and atypical learning patterns commensurate to the students’ increasingcompetence in science understanding.

Despite differences in science topics and targeted populations, these research-ers fundamentally agree that science assessments should move away from merelymeasuring the discrete recall of science facts and begin to assess more dynamic sit-uations to capture more coherent and nuanced explanations of science concepts.

THE TECHNOLOGY-ENHANCED LEARNING INSCIENCE APPROACH: KNOWLEDGE INTEGRATION

The Technology Enhanced Learning in Science (TELS) Center is a National Sci-ence Foundation–funded multiinstitutional research center that developed power-ful inquiry-based science curriculum modules in an Internet-based environment(Linn et al., 2006). The TELS Center used a partnership design process (Linn &Holmes, 2006) involving curriculum designers, technology experts, classroomteachers, discipline experts, and researchers to develop 10 one-week long curricu-lum modules addressing common middle school and high school science topics.

This knowledge integration view of science learning and instruction resonateswith recent instructional research in science education (see Bransford, Brown, &Cocking, 1999; diSessa, 2000; Edelson, 2001; Linn et al., 2004; Quintana et al.,2004; Vosniadou & Ioannides, 1998). A core idea of knowledge integration is thatto assess complex thinking involving the science inquiry process, items shouldpose a dilemma and require the respondents to generate an argument. The TELSassessments differ from those designed by other research programs in that thesame knowledge integration construct is applied to a variety of science disciplines(e.g., genetic interpretation of simple inheritance, projectile motion, and energyexchange in chemical reactions) and consistently used for the development of cur-riculum activities, assessment items, and scoring rubrics.

ASSESSING KNOWLEDGE INTEGRATION 37

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

OBJECTIVES OF THIS STUDY

To develop a coherent assessment framework, we use the assessment triangleframework articulated in Pellegrino et al. (2001). We demonstrate an example ofhow such a framework can be adapted and applied so it can effectively measurecomplex scientific inquiry and understanding. We describe the three core elementsof the framework: (a) a knowledge integration construct, (b) the items that arealigned with the construct, and (c) the scoring rubric that reflects adequate levels ofscientific understanding. Validity evidence for the knowledge integration assess-ment framework is provided based on results from item response modeling analy-ses. We hope that the promising findings will encourage teachers to practice in-quiry instruction and assessments.

DESIGNING KNOWLEDGE INTEGRATION ITEMSAND SCORING RUBRICS

To measure knowledge integration, tests were created in six science courses: phys-ical, life, and earth sciences for middle school students, and physics, biology, andchemistry for high school students. Some items were selected from existing stan-dardized tests such as NAEP and TIMSS, whereas others were developed or usedin related knowledge integration research. The tests included both multiple-choiceand constructed response items. As all of the constructed response items requirestudent explanations, we call these items explanation items hereafter.

Cognition Foundation: Knowledge Integration Construct

Knowledge integration research has shown that students develop ideas from expe-rience, instruction, peer interaction, media depictions, and other scientific activi-ties (Linn, 1995, 1996). In many cases, students do not consciously make efforts tocompare, contrast, or connect the ideas to their existing views. The difficulty ofmaking scientifically meaningful connections is well documented in the knowl-edge transfer literature (Bransford et al., 1999). Students are able to apply scien-tific knowledge only to the original context in which the knowledge was first ac-quired. This transfer problem can be effectively addressed if science instructionhelps students to articulate their full array of ideas, add new ideas that are norma-tive and compelling, develop criteria to distinguish between these ideas, and forma more coherent and consistent view of scientific phenomena (Linn & Eylon,2006). Although many emerging instructional programs are addressing exactlythese goals, current assessment instruments used for program evaluation havethwarted the success and sustainability of these programs (DeBarger, Quellmalz,Fried, & Fujii, 2006; Linn, Husic, Slotta, & Tinker, 2006).

38 LIU, LEE, HOFSTETTER, LINN

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

The knowledge integration construct is a latent variable that progressively dis-tinguishes five levels (see Table 1). The “no answer” and the “off task” responsesare scored as 0 and 1, respectively. At the “no-link” level (scored as 2), students usenonnormative ideas and nonnormative links. At the “partial link” level (scored as3), relevant and normative ideas are generated but not elaborated enough to dem-onstrate how two ideas are connected, meaning that no specific mechanisms or re-lationships are articulated. At the “full link” level (scored as 4), students make atleast one full link between two relevant and normative ideas. Two or more full linksrepresent the “complex” link level (scored as 5). The highest level in the knowl-edge integration construct is “systemic,” describing the point at which students areable to apply scientific ideas in a variety of contexts and understand the criteria forwhen and how to use an idea. Students at this level, for example, could explain therole of energy in various contexts, such as chemical reactions, electricity, andobject trajectories.

ASSESSING KNOWLEDGE INTEGRATION 39

TABLE 1Knowledge Integration Construct and Item Examples

Knowledge Integration Categories Score Response Characteristics

Systemic linkStudents have a systemic

understanding of science concepts.

Not applied Compare similarities and differencesbetween contexts, and apply conceptsrelevant to each context.

Complex linkStudents understand how more than

two science concepts interact in agiven context.

5 Elaborate two or more scientificallyvalid links among ideas relevant to agiven context.

Full linkStudents understand how two

scientific concepts interact in agiven context.

4 Elaborate a scientifically valid linkbetween two ideas relevant to a givencontext.

Partial linkStudents recognize potential

connections between concepts butcannot elaborate the nature of theconnections specific to a givencontext.

3 Indicate a link between relevant ideasbut do not fully elaborate the link in agiven context.

No linkStudents have non-normative ideas

and/or make scientifically invalidlinks in a given context.

2 Make links between relevant andirrelevant ideas.

Have non-normative ideas.

Off task 1Students make statements about

non-scientific situations.No answer 0 Blank

Note. The knowledge integration framework was translated into five levels of sophistication andwas used to develop scoring rubrics for all the explanation items. Illustrative responses for the two itemsin Figure 1 are provided.

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

Observation: Design of Knowledge Integration Items

Two item types are used in TELS assessments: multiple choice and explanation. Toillustrate differences between the two item formats, we compare three electricityitems (Figure 1). The multiple-choice item asks students to select the choice thatrepresents their view. Although multiple-choice items can contain a range of likelyanswers constructed from different lines of reasoning, it is not possible to com-pletely eliminate false positives (students choosing a correct answer withoutknowing the content or performing the required task) and false negatives (students

40 LIU, LEE, HOFSTETTER, LINN

FIGURE 1 Examples of multiple-choice and explanation items.

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

who know the content or task but misinterpret the question and choose an incorrectanswer inadvertently).

The other items in Figure 1 ask students to explain the reasoning for their multi-ple-choice answers. For example, the energy transformation item asks students tomake a claim about the amount of light energy compared to total energy in a circuitand explain the reasoning behind their claim. Items with claims and explanationshave two important epistemological benefits. First, they emphasize a goal of sci-ence—explaining a natural phenomena (Hempel, 1966). Second, they engagelearners in scientific argumentation, an important aspect of inquiry (Black &Wiliam, 1998; Bransford et al., 1999; Sandoval, 2003).

The TELS item selection was guided by the principle that the items should beable to elicit scientific ideas and tap varied contexts. The multiple-choice partserves as an anchor for the student to articulate reasons for choosing a particularanswer in the explanation part. The choice of contexts can include abstract as wellas everyday uses of the science. For example, the blown fuse context connects tostudents’ everyday experience. All knowledge integration items should provideopportunities for students to link their ideas about scientific phenomena that varyin difficulty. Advantages of both multiple-choice and explanation items are uti-lized. Students who can successfully make connections among scientifically validideas usually succeed on multiple-choice questions. Those who fail multi-ple-choice items generally have difficulty developing arguments. At the sametime, explanation items overcome the constraints of multiple-choice items in mea-suring sophisticated argumentation and capturing the knowledge integration goalsof science courses.

Because the goal of TELS research relates to assessing powerful inquiry sci-ence curriculum modules embedded with scientific simulations, we also includeditems that ask students to use models in explanation items. As the knowledge inte-gration framework emphasizes linking everyday experiences with scientific con-cepts, we used a wide range of scientific contexts in real life. The composition ofthe six resulting TELS assessments is shown in Table 2. The items were pilot testedwith a small group of students for overall clarity and accessibility. A few localteachers reviewed the items for content coverage and alignment with their instruc-tion. Information from all of these sources was used to revise the TELS assess-ments prior to administering them to the entire TELS student sample.

Interpretation: Scoring and Validity

This section explains how student performance data on the TELS assessments canprovide information on student knowledge integration, and how the validity con-cerns are addressed.

Scoring rubrics. A four-step coding scheme (see Figure 2) was developed tocreate rubrics for explanation items based on the knowledge integration construct.

ASSESSING KNOWLEDGE INTEGRATION 41

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

First, discrete ideas are identified in the student’s explanation and examined fortheir scientific relevance to the context represented in the item and scientific nor-mality. If all or some of the ideas are identified as relevant and normative, then theexistence of the link(s) between two ideas is determined. Each link is examined forscientific validity and sophistication. A full link is assigned when the connectionbetween two relevant ideas is fully elaborated in a scientific way. An explanationwith two or more full links is classified as complex link.

The knowledge integration rubric differs from the scoring scheme of NAEP orTIMSS in two ways. First, NAEP and TIMSS reward the completeness of informa-tion by giving full credits to responses that include all of the necessary informationand only partial credits to those omit some critical information. The knowledge in-tegration rubric rewards the sophisticated use of student ideas by focusing on elab-orated links rather than individual ideas. Second, NAEP and TIMSS dichotomizestudents’ open-ended responses either as correct or incorrect, whereas TELS cap-tures a full range of student responses—irrelevant, no, partial, full, and complex.Typically, the correct category of the NAEP and TIMSS coding rubrics includesthe partial, full, and complex link categories in the knowledge integration rubric.Figure 3 provides a comparison of the TIMSS and TELS scoring rubrics for Item 2in Figure 1. The TIMSS approach distinguishes only between correct and incorrectanswers, whereas the knowledge integration approach further differentiates thecorrect answers dependent on the sophistication of the student’s reasoning. Thedifferent response patterns identified from the use of multiple scoring levels can

42 LIU, LEE, HOFSTETTER, LINN

TABLE 2TELS Benchmark Test Composition by Content Area

MultipleChoice Explanation

TestST(N)

RE(N)

ST(N)

RE(N)

Item Total(N)

Student Total(N) �

MSPS 11 7 0 9 17 728 .75MSLS 8 5 4 5 22 673 .76MSES 10 5 2 5 22 741 .77PHY 15 3 2 4 24 360 .80CHEM 9 2 3 2 16 627 .75BIO 10 2 5 4 21 588 .81Total 63 24 16 29 132 3,717 .75

Note. The multiple-choice items were scored dichotomously, regardless of the source; the stan-dardized explanation items were scored using the published rubric; and the research explanation itemswere scored using the TELS knowledge integration scoring rubric. TELS = Technology EnhancedLearning in Science; ST = standardized test items; RE = TELS research items; � = Cronbach’s alpha;MSPS = middle school physical science; MSLS = middle school life science; MSES = middle schoolearth science; PHY = physics; CHEM = chemistry; BIO = biology.

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

43

FIGURE 2 A four-step knowledge integration scoring scheme.

FIGURE 3 A comparison between a Trends in International Mathematics and Science Study(TIMSS) rubric and the knowledge integration scoring rubric. Note. The rubric is for Item 2 inFigure 1. The Correct category in the TIMSS rubric is expanded to three knowledge integrationlevels depending upon the sophistication of students’ ideas and links.

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

provide more precise information for teachers to diagnose the learning progressionof student reasoning with the science materials. In addition, for TIMSS and NAEPitems, the number of score levels depends on the amount of information needed tosolve the problem successfully and thus varies from item to item. In contrast, theknowledge integration construct represents the same cognitive levels across differ-ent science contexts.

This knowledge integration characterization of student developmental stages issimilar to the construct-based modeling approach described by Wilson and Sloane(2000), who characterized developmental stages for each curricular objective orlearning goal. Because the knowledge integration scoring rubric is applied to allthe items, scores from different items are comparable. For example, a score of 3suggests the same performance level across items, that is, the student is able to es-tablish “partial links.” Different knowledge integration scores on two items meansthat the student can make a better connection among ideas in one item context thanin the other. Therefore, the different distributions of student scores on items reflectthe item difficulty level (e.g., easy items should have more high-scoring students).

Applying the rubric to specific items requires refining the rubric. In many cases,determining the levels of responses for individual items requires careful analysisof student responses. Students typically generate ideas about scientific phenomenathat are not anticipated by the design team. Consequently, in comparison to tradi-tional short-answer items, a much-greater array of ideas about scientific phenom-ena emerges when knowledge integration assessments are administered. The ideasthat students generate on their initial use of knowledge integration assessmentsalso calibrate to the design of instruction. These ideas are likely to occur in scienceclasses and require attention in the curriculum.

Validity. Informed by Messick (1989) and Wilson (2005), we implementedseveral validity procedures during the knowledge integration assessment develop-ment process. To ensure construct validity, items are designed to align with theknowledge integration construct to measure one or multiple levels of competencesdescribed by the construct. The development of the scoring rubrics reflects otheraspects of construct validity. For instance, the scoring rubrics are designed to ap-propriately categorize and rank a full range of student responses. The knowledgeintegration scoring levels in each rubric are qualitatively distinct enough to capturedifferent kinds of scientific reasoning, and at the same time, the scoring levels arecomprehensive enough to capture all possible student responses. To ensure contentvalidity, assessments are developed to measure the typical knowledge in the con-tent domain and reviewed by content experts to guarantee adequate representationof scientific phenomena. In addition, the items are carefully examined by educa-tional researchers to ensure that adequate opportunities are provided for studentsto demonstrate their knowledge integration inquiry thinking.

44 LIU, LEE, HOFSTETTER, LINN

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

INVESTIGATING PSYCHOMETRIC PROPERTIES BASEDON ITEM RESPONSE MODELING

The TELS assessments were implemented at the end of the academic year.Teachers administered a subject-area test aligned with their instruction. All sixTELS assessments were administered as paper-and-pencil tests. Researchers werepresent to assist with the administration of the assessments. Each test took approx-imately 20 to 30 min to complete. The multiple-choice items were scored dichoto-mously, and the explanation items were scored using the knowledge integration ru-bric. Two raters rated each explanation item, and the interrater reliability was 0.9 orhigher. Table 2 includes the item composition for each test as well as the samplesize and Cronbach’s alpha (Cronbach, 1951). In total, more than 3,000 studentsparticipated in the TELS assessments. Each of the six tests has satisfactoryreliabilities ranging from .75 to .81.

The Rasch Partial Credit Model (PCM)

Item response theory (IRT) techniques were used to provide validity and reliabilityevidence for the TELS assessments. Because the items have multiple response cat-egories, with the difficulty levels between any two adjacent categories varyingacross the items, the Rasch PCM (Masters, 1982; Wright & Masters, 1982) wasused for analyses. The probability of student n scoring x on item i can be expressedas

where � stands for student latent ability, �i refers to the item difficulty, and �ij(j = 0,1, …, m) denotes the additional step parameter associated with each categoryscore of j of item i.

One of the advantages of IRT is that it can provide both item-level and test-levelinformation, which is important for guiding future designs of TELS assessments.In addition, IRT provides a standard error of measurement for each distinct studentresponse vector, making it a more accurate reflection of the true situation than sim-ply assigning a single standard error of measurement to all students. Furthermore,the invariance of item parameters helps to establish a stable measurement scale forfuture usage, as TELS plans to develop longitudinal trajectories of student perfor-mance on the basis of multiyear testing. The psychometric properties of the TELSassessments were examined with respect to item fit, average weighted likelihoodestimate (WLE), differential item functioning (DIF) for gender, Wright map, inter-

ASSESSING KNOWLEDGE INTEGRATION 45

0

0 0

exp[ ( )]

( ) (1)

[exp ( )]i

x

n i ij

rnix m r

n i ij

r j

P �

� �

� �

� �

� �

θ δ τθ

θ δ τ

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

nal consistency, and discrimination index. The software program ConQuest wasused to perform the analyses (Wu, Adams, & Wilson, 1998).

Quality Indicators

Item fit. Fit statistics can detect discrepancies between the partial creditmodel we adopted for our analysis and the observed student responses (Wright &Masters, 1982). For each item parameter, ConQuest provides a weighted fit meansquare statistic for the agreement between the model and the data (Wu, Adams, &Wilson, 1998). Weighted fit mean square has an expected value of 1. Adams andKhoo (1996) indicated that a reasonable bound for this statistic is between .75 and1.33. An item with a larger than 1 fit index is considered to have more variance inthe data than the model expects and is often associated with a low discriminationindex. An item with a less than 1 fit index indicates less variance than the model ex-pects and is frequently associated with a high discrimination index.

The fit statistics are summarized in Table 3. The majority of the items have anacceptable range of fit statistics between .75 and 1.33, except for the High SchoolPhysics TELS test. There are three misfit items in this test. One item has a largerthan 1.33 fit statistic, indicating more observed than modeled variance. This itemalso fails to distinguish between high- and low-ability students as shown by its lowdiscrimination index (.09). The average abilities indicated by the WLE values forstudents who scored 0 and 1 on this item are basically identical (–.77 vs. –.78). Toensure the quality of TELS assessments, such items with large misfit statistics areexcluded from the item pool for future use.

The other two misfit items in the High School Physics TELS test have fit statis-tics less than .75. Both items have very high discrimination indices (.87 and .71),indicating that these items can discriminate well between the high-performing andlow-performing students. This may have contributed to small fit statistics. In gen-

46 LIU, LEE, HOFSTETTER, LINN

TABLE 3Fit Statistic and Average Weighted Likelihood Estimate

TestItems(N)

Fit Statistic< .75 (%)

Fit Statistic.75–1.33 (%)

Fit Statistic> 1.33 (%)

OrderedWLE (%)

ReversedWLE (%)

MSPS 27 0.0 100.0 0.0 96.3 3.7MSLS 22 0.0 95.5 4.5 91.0 9.0MSES 22 0.0 100.0 0.0 100.0 0.0PHY 24 12.5 83.3 4.2 100.0 0.0CHEM 16 0.0 100.0 0.0 100.0 0.0BIO 21 0.0 100.0 0.0 85.7 14.3

Note. Ordered WLE for an item suggests that the students with higher scores on the item also havehigher scores on the entire test; otherwise, the item has reversed average WLE. WLE = weighted likeli-hood estimate; MSPS = middle school physical science; MSLS = middle school life science; MSES =middle school earth science; PHY = physics; CHEM = chemistry; BIO = biology.

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

eral, items with small fit statistics are less problematic than those with large fit sta-tistics. They should, however, be examined with caution to see whether the targetgroup of students falls into the high discriminating range. In the case of the TELSassessments, one important goal is to identify those students who have difficultiesin synthesizing scientific ideas and to address this issue through revamped instruc-tion. Therefore, it is important to develop items that can differentiate between dif-ferent levels of understanding.

Average WLE. WLE is one kind of estimate used to indicate student ability(Warm, 1989). For each scoring category on each item, ConQuest calculates theaverage WLE for all of the students scored in the same category. The average WLEcan be used to attest the theoretical design of an item. The rationale is that studentswho score higher on an item should have a higher mean estimate than those whoscore lower on the item. If this criterion is met, then the item has ordered WLEs;otherwise, the item has reversed WLEs. In this study, a higher WLE indicateshigher levels of knowledge integration ability.

Table 3 provides the number of items with ordered and reversed WLE categoriesfor each test. Most of the items show ordered categories. The High School BiologyTELS test has the largest percentage of reversed category items. There are two com-mon reasons for the presence of reversed categories. One reason is that the score lev-els on the rubric cannot distinguish between students in corresponding performancelevels. In this case, the scoring rubric needs to be modified. In addition, reverse cate-gory items appear when too few students choose a certain category resulting in inac-curate estimation of the average WLE. In the case of the High School Biology test,the items showing reversed WLEs are difficult items, with too few students reachingthe top two categories (scores 4 and 5). Despite this observation, we did not combinethe top two score levels in the rubrics both because we expect more students toachieve the higher score level in the future, following our planned curriculum inter-vention, and because we want to maintain the same scoring scheme for all items.

DIF for gender. DIF is used in the TELS assessments to detect potential gen-der bias. An item displays gender DIF if the probability of a correct response to thisitem is significantly different for male and female participants at the same abilitylevel. In the Rasch type model framework, DIF is interpreted as the interaction be-tween the item difficulty parameter and the gender identity indicator, after control-ling for the mean ability. The DIF model is expressed as follows

ASSESSING KNOWLEDGE INTEGRATION 47

0

0 0

exp[ ( )]

( ) (2)

[exp ( )]i

x

n ig ij

jnix m r

n ig ij

r j

P �

� �

� �

� �

� �

θ δ τθ

θ δ τ

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

where �ig (g = 0, 1) is the item difficulty parameter for students in each gendergroup (Holland & Wainer, 1993). If � (the difference between �io and �i1) standsfor the ConQuest DIF estimate, the items are classified as showing negligible DIFwhen |2�i|<0.426, intermediate DIF when 0.426��2�i|<0.638, and large DIFwhen 0.638��2i|.

Table 4 shows the items identified with a significant across the six TELS tests.Some of the DIF items favor male students, whereas others favor female students,after controlling for the general knowledge integration ability. According to theDIF magnitude classification rule just described, all of the items with a significantin the TELS assessments have negligible DIF, suggesting no systematic genderbias within the six TELS tests.

Wright maps. The Wright maps include the student ability distribution andthe item difficulty distribution (see Figure 4). These two distributions are cali-brated onto the same metric to allow direct comparisons. Each x represents a num-ber of students, with the number depending on the sample size. On the map, the rel-ative position of the student ability estimate and item difficulty estimatedetermines the student response on that item. For example, for a dichotomousitem,1 if a student estimate and an item estimate are at the same position, it meansthat this student has a .5 probability of answering the item correctly. If the studentposition is above the item position, the probability for a correct answer is largerthan .5, and if below, less than .5. In other words, when the student estimate and

48 LIU, LEE, HOFSTETTER, LINN

TABLE 4Test Performance Comparison by Gender

Favoring Female Favoring Male

TestItems(N)

IdentifiedDIF Items

(N)

NegligibleDIF Items

(N)

IdentifiedDIF Items

(N)

NegligibleDIF Items

(N)

MSPS 27 2 2 4 4MSLS 22 2 2 2 2MSES 22 0 0 2 2PHY 24 0 0 3 3CHEM 16 0 0 0 0BIO 21 2 2 1 1

Note. DIF = differential item functioning; MSPS = middle school physical science; MSLS = mid-dle school life science; MSES = middle school earth science; PHY = physics; CHEM = chemistry; BIO= biology.

1For polytomous items, item thresholds are available to determine the probability of obtaining aparticular score.

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

item estimate are close to each other, maximum information is obtained about thestudent’s proficiency. If the item is too easy or too difficult, little information isgleaned. The item distribution should cover the span of the student estimates, sothat the items can provide accurate measures for students of all performance levels.

Figure 4 illustrates three Wright maps for three TELS tests: High School Biol-ogy, Chemistry, and Middle School Physical Science. The Wright map for theHigh School Biology test demonstrates an effective coverage of the item distribu-tion for the entire range of ability distribution. Most of these items, as well as thestudent ability estimates, have difficulty estimates from –0.50 to 1.5. In addition,

ASSESSING KNOWLEDGE INTEGRATION 49

FIGURE 4 Wright maps for three Technology Enhanced Learning in Science (TELS) bench-mark tests. Note. Each x represents a number of students depending on the sample size. For ex-ample, an x stands for 4.3 students for the High School Biology Test, 5.1 for the High SchoolChemistry Test, and 5.3 for the Middle School Physical Science Test.

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

some relatively difficult and easy items exist at the two tails of the item distribu-tion, providing appropriate measures for students with high and low knowledge in-tegration abilities. Because the coverage is adequate, no major change is neededfor the High School Biology test in terms of item difficulties.

The Wright maps for the other two tests show less than ideal situations. For in-stance, on the High School Chemistry test, the student knowledge integration abil-ity estimates range from about –2.0 to 2.5, which is much wider than the item diffi-culty range of about –0.80 to 2.3. This suggests that more easy items should beincluded to provide an accurate measure for students at the lower end of the abilitycontinuum. In other words, the chemistry items are too difficult for low-achievingstudents and can easily result in a “floor effect.” On the contrary, the items in theMiddle School Physical Science test are relatively too easy for the students. TheWright map shows that the student ability estimates range from about –1.8 to 2.5,whereas the item difficulty estimates range from about –2.0 to 1.0. The distributionof the student ability estimates concentrates on the upper part of the item difficultydistribution, suggesting that not enough items are available to provide an accuratemeasure for students at the high performance level. A “ceiling effect” can occur forhigh-ability students when there are too few opportunities for those students todemonstrate what they know.

To address these issues informed by the Wright maps, the TELS assessmentteam, in collaboration with content experts, developed additional items to achievea balanced design of item difficulties for each test.

Discrimination index. Based on the classical test theory framework (Allen& Yen, 2002), the discrimination index indicates the extent to which an item distin-guishes high-scoring from low-scoring students. Items with discrimination in-dexes lower than 0.3 are considered to have poor discriminating power by the con-ventional rule.

Table 5 provides the discrimination index for the multiple-choice items and ex-planation items included in the six TELS tests. Because one of the purposes of thisstudy is to evaluate how TELS knowledge integration research items perform com-pared to the standardized items, we provide the discrimination index informationby item type and by item source.

The upper portion of Table 5 presents all of the items with less than 0.3 discrim-ination index. All of these are multiple-choice items except for one explanationitem from the standardized tests. None of the TELS research explanation itemshave a similar low discrimination index. Among the multiple-choice items, mostof the low discriminating items come from the published standardized tests. Onlytwo TELS multiple-choice items show low discrimination. These two items areamong the easiest items on the tests, and in turn, they fail to distinguish betweenhigh-performing and low-performing students.

50 LIU, LEE, HOFSTETTER, LINN

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

The lower portion of Table 5 shows the percentage of high discriminating itemsacross six TELS tests. All of the research explanation items coded with the knowl-edge integration rubric have high discrimination indexes. Similarly, most of theexplanation items from the standardized tests have high discrimination indexes.Therefore, it is clear that the explanation items with the knowledge integrationscoring provide a more effective measure of student performance than the multi-ple-choice items in differentiating between high-performing and low-performingstudents, regardless of the item source.

DISCUSSION

Assessments play an increasingly important role in science education: shaping in-struction, providing diagnostic information for teachers, illustrating complexthinking, and promoting multiple problem-solving strategies. The various func-tions impose high demands on the creation and validation of assessments. Many

ASSESSING KNOWLEDGE INTEGRATION 51

TABLE 5Discrimination Index for Multiple-Choice Items and Explanation Items

in the Six TELS Tests

Standardized MC Research MC Standardized EP Research EP

TestItems(N) %

Items(N) %

Items(N) %

Items(N) %

% items with discrimination index = 0.3MSPS 11 55 7 0 1 0 8 0MSLS 8 50 5 0 4 0 5 0MSES 10 40 5 0 2 0 5 0PHY 15 27 3 33 2 0 4 0CHEM 9 22 2 50 3 0 2 0BIO 10 20 2 0 5 20 4 0

% items with discrimination index = 0.5MSPS 11 0 7 29 1 100 8 100MSLS 8 0 5 20 4 100 5 100MSES 10 10 5 80 2 100 5 100PHY 15 27 3 33 2 100 4 100CHEM 9 0 2 0 3 100 2 100BIO 10 10 2 0 5 80 4 100

Note. The percentage is calculated by the number of standardized multiple-choice items with dis-crimination index below 0.3 divided by the total number of standardized multiple-choice items. MC =multiple choice; EP = explanation items; MSPS = middle school physical science; MSLS = middleschool life science; MSES = middle school earth science; PHY = physics; CHEM = chemistry; BIO =biology.

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

researchers have called for a more coherent assessment framework. Still, the chal-lenge to such a framework lies in multiple folds: lack of cognitive foundations,poorly defined constructs, disconnection between assessment items and the targetconstruct, and oversimplified scoring rubrics, to name a few. Incoherency in any ofthese components would affect the interpretation of the results, implications of thefindings, and the validation of the assessments. The significance of this study liesin its illustration of how classroom-based assessments can be well designed andimplemented to measure complex scientific understanding while remainingclosely aligned with instruction. TELS assessments use the knowledge integrationframework to guide the development of assessment items, the design of scoring ru-brics, and the interpretation of findings. Compared with traditional standardizedtests, TELS assessments ask students to establish connections between newknowledge and existing knowledge, to integrate knowledge gained from varioussources, and to warrant these ideas with evidence. The results show that knowledgeintegration, as an emerging construct, can guide the design of valid items coveringa variety of topics. Therefore, we encourage researchers undertaking similar tasksto clearly define the construct(s) before starting the process of item development.A clearly specified construct can help address questions such as “Is this what Iwant to measure?” “Can the items adequately represent the construct?” and “Howcan I rank or categorize student responses based on the construct?”

This article also calls attention to the critical role of effective scoring rubrics.Standardized test designers are making efforts to tap into complex thinking by in-creasing the proportion of open-ended items (e.g., NAEP, TIMSS). These items,however, do not necessarily warrant more sensitive or effective assessments of sci-ence inquiry if the “right-or-wrong” scoring scheme has not been substantiallychanged. The way the knowledge integration scoring rubric is constructed capturesan important cognitive aspect, allows comparisons across different items, and atthe same time increases accessibility for classroom practitioners to utilizeopen-ended questions. Many times the use of constructed response items is limitedby flawed scoring rubrics that do not differentiate between responses or that in-clude too many scoring categories. The five-level knowledge integration rubricused in the TELS assessments is distinct enough to represent a progression of com-petency in linking scientific ideas. It is also concise enough not to exhaust teachers.

In addition, this study sheds light on the process of validating nonstandardizedtests. Classroom-based assessments have been challenged for their inadequatetechnical qualities, and the technical concerns have to be addressed appropriatelybefore these assessments can play a more central role in accountability systems.This research draws on the theoretical strengths of IRT and extends this method tonew cognitive goals. IRT has been applied extensively in the development and vali-dation of large-scale standardized tests. Still, its functions have not been fully ex-plored to facilitate assessments that are more aligned with instruction. The IRT

52 LIU, LEE, HOFSTETTER, LINN

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

modeling techniques help evaluate the empirical quality of the assessment itemsand inform about further item refinement. The six tests described in this article, de-signed to measure student knowledge integration ability, in general displayed sat-isfactory psychometric properties in terms of validity and reliability. Items with alarge misfit and low discrimination index were excluded from future use. The evi-dence collected from multiple quality indicators (i.e., fit statistics, Wright map,WLE, DIF) helped the TELS assessment team determine the appropriate depth,scope, overall difficulty, and scoring scheme of the items. Overall, the research ex-planation items performed better than multiple-choice items in differentiating dif-ferent levels of knowledge integration application. Therefore, to measure studentscience inquiry ability in applying knowledge integration, it is crucial to includeitems that reward complex thinking and are sensitive to instruction.

The TELS approach is shaped by the view that assessment stimulates learning,rather than just indicating learning (Black & Wiliam, 1998). Assessment rein-forces the learning goals of building coherence across the repertoire of ideas ratherthan focusing on discrete facts. Furthermore, it cultivates an atmosphere of com-plex thinking in the classroom and spurs teachers to promote critical thinking intheir classroom practice. The TELS assessments provide examples of ways to de-sign tests that are cognitively sound and at the same time meet technical standards.The variety of the content topics covered by the TELS assessments allow the re-sults to be generalized to other science education domains.

In summary, we take advantage of national and international assessment itemsbut improve their psychometric properties by using a more complex scoring rubric.We show that items scored for knowledge integration can distinguish between typ-ical and inquiry science curricula. We expect that such items would also encourageteachers to emphasize inquiry instruction.

CONCLUSION

In conclusion, assessments designed to measure knowledge integration have dem-onstrated satisfactory evidence for validity. They measure student ability to sort,link, distinguish, and evaluate ideas and evidence. Comparisons between multi-ple-choice items and explanation items indicate that explanation items are better atdifferentiating student science performance. The knowledge integration scoringrubrics, when applied to polytomously scored explanation items, serve well to im-prove the sensitivity of the items.

The item response models provide empirical evidence for the knowledge inte-gration measures. The alignment between measurement tools and sound cognitivetheories suggests a viable approach as we strive to design quality science assess-ments that can help all students learn.

ASSESSING KNOWLEDGE INTEGRATION 53

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

ACKNOWLEDGMENTS

Ou Lydia Liu is currently at Educational Testing Service, Princeton, New Jersey.Funding for this article was provided by the National Science Foundation under

Grants ES1-0334199 and ES1-0455877.

REFERENCES

Adams, R. J., & Khoo, S. (1996). ACER quest—The interactive test analysis system. Camberwell, Aus-tralia: ACER Press.

Allen, M. J., & Yen, W. M. (2002). Introduction to measurement theory. Long Grove, IL: WavelandPress.

Black, P., & Wiliam, D. (1998). Inside the black box: Raising standards through classroom assessment.Phi Delta Kappan, 80, 139–148

Bransford, J. D., Brown, A. L., & Cocking, R. R. (Eds.). (1999). How people learn: Brain, mind, expe-rience, and school. Washington, DC: National Research Council.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16,297–334.

DeBarger, A. H., Quellmalz, E., Fried, R., & Fujii, R. (2006, April). Examining validities of science in-quiry assessments with cognitive analyses. Paper presented at the annual meeting of American Edu-cational Research Association, San Francisco, CA.

diSessa, A. A. (2000). Changing minds: Computers, learning and literacy. Cambridge, MA: MITPress.

Edelson, D. C. (2001). Learning-for-use: A framework for the design of technology-supported inquiryactivities. Journal of Research in Science Teaching, 38, 355–385.

Hempel, C.G. (1966). Philosophy of natural science. Englewood Cliffs, NJ: Prentice-HallHolland, P., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum

Associates.IEA (1995). Third International Mathematics and Science Study/TIMSS Science Items: Released Set

for Population Z (Seventh and Eighth Grades). Retrieved from http://timss.bc.edu/timss1995i/TIMSSPDF/BSItems.pdf

Linn, M. C. (1995). Designing computer learning environments for engineering and computer science:The scaffolded knowledge integration framework. Journal of Science Education and Technology, 4,103–126.

Linn, M. C. (2006). The knowledge integration perspective on learning and instruction. In R. K. Sawyer(Ed.), The Cambridge handbook of the learning sciences (pp. 243–264). New York: Cambridge Uni-versity Press.

Linn, M. C., Davis, E. A., & Bell, P. (Eds.). (2004). Internet environments for science education.Mahwah, NJ: Lawrence Erlbaum Associates.

Linn, M. C., & Eylon, B. S. (2006). Science education: Integrating views of learning and instruction. InP. A. Alexander & P. H. Winne (Eds.), Handbook of educational psychology (2nd ed., pp. 511–544).Mahwah, NJ: Lawrence Erlbaum Associates.

Linn, M. C., & Holmes, J. (2006). Establishing a design process for technology enhanced learning(TELS Report). Berkeley: University of California, Graduate School of Education.

Linn, M. C., Husic, F., Slotta, J., & Tinker, R. (2006). Technology enhanced learning in science(TELS): Research programs. Educational Technology, 46(3), 54–68.

54 LIU, LEE, HOFSTETTER, LINN

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011

Linn, M. C., & Hsi, S. (2000). Computers, teachers, peers: Science learning partners. Mahwah, NJ:Lawrence Erlbaum Associates.

Linn, M. C., Lee, H. S., Tinker, R., Husic, F., & Chiu, J. L. (2006). Teaching and assessing knowledgeintegration in science. Science, 313, 1049–1050.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–173.Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New

York: Macmillan.Mislevy, R. J. (2003). A brief introduction to evidence-centered design (Technical). Los Angeles: Uni-

versity of California, National Center for Research on Evaluation, Standards, and Student Testing.National Center for Education Statistics. (2005). National assessment of educational progress: The na-

tion’s report card. Washington, DC: Author.National Research Council. (1996). National science education standards. Washington, DC: National

Academy Press.National Research Council. (2000). National science education standards. Washington, DC: National

Academy Press.Organisation for Economic Co-operation and Development. (2003). The PISA 2003 assessment frame-

work. Paris: Author.Pellegrino, J. W., Chudowsky, N., & Glaser, R. (Eds.). (2001). Knowing what students know: The sci-

ence and design of educational assessment. Washington, DC: National Research Council.Quintana, C., Reiser, B. J., Davis, E. A., Krajcik, J., Fretz, E., Golan, R. D., et al. (2004). A scaffolding

design framework for software to support science inquiry. Journal of the Learning Sciences, 13,337–386.

Sandoval, W. A. (2003). Conceptual and epistemic aspects of student’ scientific explanation. Journal ofthe Learning Science, 12, 5–51.

Songer, N. B., & Gotwals, A. W. (2006, April). The development and evaluation of an assessment sys-tem to chronicle non-mainstream students’ learning progressions associated with deep thinking inscience. Paper presented at the annual meeting of American Educational Research Association, SanFrancisco, CA.

Vosniadou, S., & Ioannides, C. (1998). From conceptual development to science education: A psycho-logical point of view. International Journal of Science Education, 20, 1213–1230.

Wang, J. (2006). An empirical study of gender difference in the relationship between self-concept andmathematics achievement in a cross-cultural context. Educational Psychology, 26, 689–706.

Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika,54, 427–445.

Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Law-rence Erlbaum Associates.

Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Ap-plied Measurement in Education, 13, 181–208.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: Mesa Press.Wu, M., Adams, R., & Wilson, M. (1998). ConQuest: Generalized item response modeling software.

Melbourne, Australia: ACER Press.Yeh, S. S. (2006). Tests worth teaching to: Constructing state-mandated tests that emphasize critical

thinking. Educational Researcher, 30, 12–17.

ASSESSING KNOWLEDGE INTEGRATION 55

Dow

nloa

ded

by [

ET

S] a

t 13:

48 2

6 O

ctob

er 2

011