Test Directions as a Critical Component of Test Design: Best Practices and the Impact of Examinee...

31
Running Head: TEST DIRECTIONS AND FAMILIARITY 1 Test Directions as a Critical Component of Test Design: Best Practices and the Impact of Examinee Characteristics Joni M. Lakin Auburn University To appear in Educational Assessment Author Note Joni M. Lakin, Department of Educational Foundations, Leadership, and Technology, Auburn University. This work was supported by a fellowship from the American Psychological Foundation. The author gratefully acknowledges the helpful comments of David Lohman, Kurt Geisinger, Brent Bridgeman, Randy Bennett, and Dan Eignor on earlier drafts of this article. Correspondence concerning this article should be addressed to Joni Lakin, Department of Educational Foundations, Leadership, and Technology, Auburn University, Auburn, AL 36849. Email: [email protected]

Transcript of Test Directions as a Critical Component of Test Design: Best Practices and the Impact of Examinee...

Running Head: TEST DIRECTIONS AND FAMILIARITY 1

Test Directions as a Critical Component of Test Design:

Best Practices and the Impact of Examinee Characteristics

Joni M. Lakin

Auburn University

To appear in Educational Assessment

Author Note

Joni M. Lakin, Department of Educational Foundations, Leadership, and Technology, Auburn

University.

This work was supported by a fellowship from the American Psychological Foundation. The

author gratefully acknowledges the helpful comments of David Lohman, Kurt Geisinger, Brent

Bridgeman, Randy Bennett, and Dan Eignor on earlier drafts of this article.

Correspondence concerning this article should be addressed to Joni Lakin, Department of

Educational Foundations, Leadership, and Technology, Auburn University, Auburn, AL 36849. Email:

[email protected]

Running Head: TEST DIRECTIONS AND FAMILIARITY 2

Abstract

The purpose of test directions is to familiarize examinees with a test so that they respond to items in the

manner intended. However, changes in educational measurement as well as the U.S. student population

present new challenges to test directions and increase the impact that differential familiarity could have

on the validity of test score interpretations. This article reviews the literature on best practices for the

development of test directions as well as documenting differences in test familiarity for culturally and

linguistically diverse students that could be addressed with test directions and practice. The literature

indicates that choice of practice items and feedback are critical in the design of test directions and that

more extensive practice opportunities may be required to reduce group differences in test familiarity. As

increasingly complex and rich item formats are introduced in next-generation assessments, test directions

become a critical part of test design and validity.

Running Head: TEST DIRECTIONS AND FAMILIARITY 3

Test Directions as a Critical Component of Test Design: Best Practices and the Impact of Examinee

Characteristics

Introducing new or unfamiliar item formats to examinees poses particular challenges to test

developers because of the need for students to quickly and accurately understand what the test items

require (Haladyna & Downing, 2004; Lazer, 2009). Item formats are defined not only by their response

modes (e.g., multiple choice, constructed response), but also by the types of demands they are expected to

place on examinee’s knowledge and cognition (e.g., basic arithmetic, scientific reasoning; Martinez,

1999). When introducing new item formats, the critical challenge is how best to introduce the task so that

all students are able to respond to the format as intended by the test developers.

With new innovations in achievement testing, especially through computer-based testing and

performance assessment, novel item formats are increasingly being introduced to achievement tests to

increase construct representation and measure more complex and authentic thinking skills (e.g., National

Center for Education Statistics [NCES], 2012; see also Bennett, Persky, Weiss, & Jenkins, 2010; Fuchs et

al., 2000; Haladyna & Downing, 2004; Lazer, 2009; Parshall, Davey, & Pashley, 2000). In these new

formats, simulations, computer-based tools, and items with complex scoring procedures are used for both

teaching and assessment (Bennett et al., 2010; NCES, 2012). Furthermore, the role of innovative formats

(and thus initially unfamiliar formats) will only increase in the near future because computer-based

assessments with more complex and authentic test items form a central component of the general

assessment consortia formed in response to Race to the Top (PARCC and SMARTER Balanced

Assessment Consortia; Center for K–12 Assessment & Performance Management, 2011; Lissitz & Jiao,

2012).

When item formats are unfamiliar, test directions and practice become a crucial, but often

overlooked, component of test development and a threat to valid score interpretation (Haladyna &

Downing, 2004). Test directions serve to explain unfamiliar tasks to test-takers and are often the only

means that test developers have of diminishing potential unfamiliarity or practice effects. Further, if the

test developers suspect there will be preexisting group differences in test familiarity, test directions are

Running Head: TEST DIRECTIONS AND FAMILIARITY 4

their last opportunity to address this validity threat as well. If groups differ in their understanding of the

test task after the directions have been completed, bias can be introduced to the test scores, resulting in

score differences between the two groups that do not reflect real group differences in ability (Clemans,

1971; Cronbach, 1984). As James (1953) advocated, test directions should “give all candidates the

minimum amount of coaching which will enable them to do themselves full justice on the test. All

candidates are then given a flying start and not just some of them” (p. 159). The goal of this article is to

outline some key considerations in developing high quality test directions. This is particularly important

in the context of a culturally and linguistically diverse society where test familiarity is more common

among some cultural and linguistic groups and creates a barrier to educational opportunities for

individuals from other groups (Kopriva, 2008). These disparities are discussed below.

Importance of Developing Good Directions and Practice Activities

Test directions are an instructional tool whose purpose is to teach examinees the sometimes

complex task of how to solve a series of test items. Depending on the students and test purpose, these

directions may include general testing taking tips (e.g., “Don’t spend too long on any one item!”;

Millman, Bishop, & Ebel, 1965). Some directions rely heavily on examples accompanied by oral or

written explanations of the task, while others consist only of brief oral or written introductory statements.

The effectiveness of a set of directions depends not only on the characteristics of those directions, but also

on the characteristics of the examinees, including their ability to understand the written or spoken

directions and familiarity with the test tasks. Good test directions provide enough information to

familiarize examinees with the basic rules and process of answering test items so that test items are

answered in the manner intended by the test creator (AERA, NCME, & APA, 1999; Clemans, 1971).

With increasingly complex computer-based or interactive tasks, efficiency in answering the questions and

understanding the task (with minimal guidance from a proctor) is also critical. The Standards (AERA et

al., 1999) further emphasize that examinees must have equal understanding of the goals of the test, basic

solution strategies, and scoring processes for test fairness to be assured. The Standards suggest the use of

practice materials and example items as part of the test directions to accomplish this goal.

Running Head: TEST DIRECTIONS AND FAMILIARITY 5

On some tests, such as classroom tests, traditional achievement tests, and even aptitude tests for

college applicants, it is assumed that understanding the task is trivial and so the directions are perfunctory

(Scarr, 1981). In these cases, the directions can be brief because the examinees have been exposed to

many similar tests (e.g., vocabulary or mathematical computation tests) throughout their education. Some

researchers even complain that examinees entirely ignore the test directions and skip straight to the test

items as soon as they are able (James, 1953; LeFevre & Dixon, 1986). In these cases, low-quality

directions are expected to have a negligible impact on test scores.

On the other hand, it has long been acknowledged that practice can significantly affect

performance on novel tasks and that differences in familiarity can lead to construct-irrelevant differences

in scores (Haladyna & Downing, 2004; Jensen, 1980; Thorndike, 1922). Several researchers exploring the

effects of training on item performance confirmed that more complex and unfamiliar formats showed the

greatest gains in response to training (Bergman, 1980; Evans & Pike, 1973; Glutting & McDermott, 1989;

Kulik et al., 1984; Jensen, 1980).

The size of practice effects likely has practical importance, but diminishes quickly after a few

exposures to a particular test (Kulik et al., 1984; te Nijenhuis et al., 2001). Practice effects on traditional

formats in parallel forms of a test have been estimated to be .23SD for a single practice test (meta-

analysis; Kulik, Kulik, & Bangert, 1984). Jensen (1980) estimated this effect to be a .33 IQ SDs gain

based on the literature up to 1980. A noteworthy point is that practice gains differ significantly by

examinee type and test design.

Thus, on tests where the formats are less likely to be familiar, the demands of understanding the

task may not be trivial and thus the directions must not be perfunctory. In the absence of a full practice

test, directions with practice examples are test developers’ best chance to decrease practice gains on novel

tests. With the introduction of complex item formats into achievement tests, such as laboratory

simulations (NCES, 2012), the novelty of the items may be irrelevant and detrimental to measuring the

desired construct in early applications of the format (Ebel, 1965; Fuchs et al., 2000).

Running Head: TEST DIRECTIONS AND FAMILIARITY 6

These effects may have critical impacts on test use. For example, Fuchs et al. (2000) found that a

brief orientation session on a new mathematics performance assessment (an unfamiliar assessment format

for their sample of students) greatly improved the scores students received (by .54 to .75SD, depending

on subtest). They concluded that unfamiliarity with the testing situation, which was particularly acute for

performance assessments where both the format and grading scheme were unfamiliar, was an obstacle to

the students’ ability to demonstrate their mathematical knowledge on the test. In their case, the

unfamiliarity not only changed the construct measured, but also created spurious growth effects that could

confound interpretations of the test scores longitudinally.

Importance of Directions in Culturally and Linguistically Diverse Examinee Populations

Over time and with additional test experience, students acquire important problem-solving skills

and knowledge relevant to the format (not only general test-wiseness [Millman et al., 1965] but format-

specific knowledge and strategies). Students with greater experience with the item format may use more

sophisticated or practiced strategies in completing the items compared to students with less experience

(Norris, Leighton, & Phillips, 2004; Oakland, 1972).

When all students are affected by item format or test unfamiliarity, the influence of unfamiliarity

can be detected through traditional validity methods such as retesting and experimental designs (such as

those used by Fuchs et al., 2000). Unfamiliarity in this case can also be responded to with a standard set

of directions. However, differential familiarity may, in fact, be an even greater threat than general

unfamiliarity to valid score interpretation (Fairbairn, 2007; Haladyna & Downing, 2004; Maspons &

Llabre, 1985; Ortar, 1960; Scarr, 1994). When familiarity only affects a subset of students or even

individual students, detecting its effects (termed individual score validity [ISV] by Kingsbury & Wise,

2012) becomes more difficult because many validation methodologies rely on accurately identifying

affected examinee subgroups, such as race/ethnicity. For example, differential item functioning methods

can only detect intermittent group differences in item functioning, not differences that affect the entire test

or an unidentified subset of students. Because of this difficulty, test developers should actively anticipate

instances where unfamiliarity could be a threat to valid and fair use of test scores for all students.

Running Head: TEST DIRECTIONS AND FAMILIARITY 7

When new item formats are introduced to a testing application, if some students are able to gain

additional practice or test exposure, these differences could potentially contribute bias to test scores

(Haladyna & Downing, 2004). These differences in ability to profit from test practice are pervasive. Prior

experience with tests and practice activities provided by parents or extracurricular programs can provide

some students with greater knowledge of typical tests and solution strategies (Hessels & Hamers, 1993).

When tests are high-stakes, these influences are especially likely and worrisome (Haladyna & Downing,

2004). But extracurricular practice is not the only concern. For students who are new to U.S. schools or

do not come from families who are engaged with their school system, access to knowledge about tests and

test preparation activities may be restricted and also lead to differential effects of item format novelty

(Budoff, Gimon, & Corman, 1974).

The critical assumption that an item format is equally familiar to all students (a foundation of

valid score interpretation; AERA et al., 1999; Haladyna & Downing, 2004) seems especially problematic

among culturally and linguistically diverse students whose experiences with testing may vary

considerably (Anastasi, 1981; Fairbairn & Fox, 2009; Peña & Quinn, 1997; Sattler, 2001). Both students

and parents can vary in their understanding of the purposes of testing, motivation, knowledge of the tests

used, and access to test preparation opportunities (Kopriva, 2008). These differences are exacerbated

when tests have high stakes attached to them (Reese, Balzano, Gallimore, & Goldenberg, 1995; Solano-

Flores, 2008; Walpole et al., 2005).

The U.S. school population has always been diverse, but the challenges for educational

assessment are increasing with rising numbers of Latino students and English learners (EL; Federal

Interagency Forum on Child and Family Statistics; 2011; Passel, Cohn, & Lopez, 2011). Latino and EL

students have a number of social and cultural differences—in addition to the linguistic differences— that

may contribute to the widely documented differences in their test performance compared to White and

non-EL students. These differences are described in the sections that follow.

Test-wiseness. There are apparent cultural differences between mainstream middle-class White

families and families of other racial/ethnic background and SES with respect to their testing-related

Running Head: TEST DIRECTIONS AND FAMILIARITY 8

parenting behaviors and beliefs about testing purposes. Although test-taking skills classified as test-

wiseness are usually not addressed by test directions, they could be and might contribute to students’

ability to profit from test directions that focus on the particular of a given format. Cultural differences

seem to affect students’ test-wiseness (general knowledge and strategies for testing). Greenfield, Quiroz,

and Raeff (2000) and Peña and Quinn (1997) found differences in assessment-related behaviors for Latino

mother-child pairs compared to White mother-child pairs. Peña and Quinn argued that differences in the

interactions of mothers with their children, specifically the types of questions that mothers asked during

playtime, related to differences in the way that the children worked on picture-labeling (vocabulary) vs.

information-seeking (knowledge) tests. They observed that the Latino children had more difficulty

understanding vocabulary tasks than they did understanding information-seeking tasks that elicited

descriptions, functions, and explanations and argued that the latter task more closely resembled the way

that the Latino mothers interacted with their children.

At school entry, test-wiseness continues to differentially affect students from culturally and

linguistically diverse backgrounds. Dreisbach and Keogh (1982) found that test-wiseness training, devoid

of test-relevant content, improved the performance of Spanish-speaking children from low SES

backgrounds more than would be expected from the general population (d=.46 between trained and

untrained students; effect size calculated from their reported data). Even among college students,

differences may persist. Maspons and Llabre (1985) found that college-aged Latino students in the

Miami-Dade area students benefitted significantly when trained in test-taking skills necessary for U.S.

testing systems. In their experimental design, they found that the correlation between their knowledge test

and later class performance in the class jumped from r=.39 for the control group to r=.67 for the

experimental group that received training in test-taking skills.

Differences in cultural and educational backgrounds appear to affect EL students’ engagement

with tests. Budoff et al. (1974) argued that EL students “differ in familiarity and experience with

particular tasks, have a negative expectancy of success in test-taking, and are less effective in

spontaneously developing strategies appropriate to solving the often strange problems on a test” (p. 235).

Running Head: TEST DIRECTIONS AND FAMILIARITY 9

The latter point makes high quality test directions especially likely to help EL students if they support

strategy development. Budoff et al. confirmed that their learning potential post-test (which allowed more

opportunities for feedback on test strategies) created substantial practice gains for EL students: d=0.68 for

primary students and d=0.40 for intermediate grade students. They also found that the post-tests

correlated more strongly with achievement for EL students compared to their pre-test scores and that

indicators of SES and English proficiency were correlated with pre-test but not post-test performance.

Perceptions. There are also important differences in perceptions of tests, too, as suggested by

Budoff et al. (1974). Students and parents new to the U.S. school system may differ in their understanding

of the purposes of testing and the basic strategies of completing tests (Kopriva, 2008; Reese et al., 1995).

For example, Walpole et al. (2005) found that students from low-SES backgrounds enrolled in low-

performing schools lacked information about the availability of test preparation materials and vastly

overestimated how much improvement could be expected from re-testing. Such differences are believed

to impact students’ motivation to perform well on these tasks and to exert effort (Kopriva, 2008; Schwarz,

1971). When the tests have high stakes (such as for gifted and talented identification or college

admissions), there are often large differences between White and minority students in the perceived

importance of preparing at home for school-administered tests (Briggs, 2001; Chrispeels & Rivero, 2001;

Kopriva, 2008; Solano-Flores, 2008; Walpole at al., 2005). Better advanced information about tests and

their uses should therefore be targeted to these parents.

Summary. Although test unfamiliarity can affect all students when test formats are novel, the

literature strongly suggests that there is a possibility of differential familiarity for any test format when

there is diversity in the student population. As a result, test developers should take care with the

development of test directions and practice activities to address any potential unfamiliarity effects that

may influence test performance.

Best Practice in Designing Test Directions and Practice Opportunities

Several older measurement texts tout the importance of good directions to valid test use and offer

some advice on the features of good directions (e.g., Clemans, 1971; Cronbach, 1984; Traxler, 1951). The

Running Head: TEST DIRECTIONS AND FAMILIARITY 10

attention to the construction of test directions appears to have diminished over time, as exhibited by the

content coverage of successive editions of Educational Measurement: Lindquist (1951), Thorndike

(1971), Linn (1989), and Brennan (2006a). The first two editions each contained a full chapter on test

administration that contained extensive advice on creating the test directions (Clemans, 1971; Traxler,

1951), whereas the latter two editions restricted their discussion to issues of standardization with little

specific advice on developing test directions (e.g., Bond, 1989; Cohen & Wollack, 2006; Millman &

Greene, 1989).

Even in these early guides, specific guidance for how to create directions was uncommon. To

some extent, this is because the best directions differ by test and item types. However, the research also

lacks systematic efforts to discover exactly what features make for good directions for a given format

especially in recent years when test item formats have been evolving considerably.

Given the infrequent inclusion of guidance on the development of directions in recent

measurement texts, test directions may appear easy to construct. However, when actually trying to write

directions, it can be quite difficult to explain an unfamiliar task clearly and succinctly so that all

examinees fully understand. This is especially true when writing directions for children because even

basic concepts must be minimized (Kaufman, 1978). Children have less experience with tests and need

simple language and adequate practice. Furthermore, when tests are less familiar and more complex,

students of any age may depend more heavily on the test directions to understand the task. From a review

of the literature, the following guidelines for creating directions and practice items are suggested. These

guidelines are then expanded upon in the following sections.

1. Treat directions as an instructional activity and build in flexibility for additional practice or

explanations for students who do not immediately understand.

2. Use the simplest language needed, but no simpler.

3. Use more practice items with greater variety, dependent on examinee characteristics.

4. Provide elaborative feedback to make examples efficient and effective for all.

Running Head: TEST DIRECTIONS AND FAMILIARITY 11

5. Gather evidence about the quality of the directions to support the intended validity inferences

for students who vary in test familiarity.

Treat Directions as an Instructional Activity

We know from extensive the work on practice effects (Jensen, 1980; Kulik et al., 1984;

Thorndike, 1922) that examinees improve their test scores when repeatedly administered similar tests.

What is the examinee learning that results in practice effects on later examinations even when the items

are not the same or no feedback is given? For examinees first encountering novel item formats, part of the

challenge of novel item formats is that they require examinees to assemble a strategy and rely more

heavily on metacognitive monitoring skills rather than well-practiced and automatic strategies

(Marshalek, Lohman, & Snow, 1983; Prins, Veenman, & Elshout, 2006). The items may also require

specific knowledge about the response format or rubric (Fuchs et al., 2000). Test directions with a number

of practice items (or a separate practice opportunity) may be sufficient to help students overcome this

initial learning phase.

Depending on the age of the examinees and how much they vary in their experience with the test

format (or tests in general), the detail of the directions and practice may need to be adapted at the student

or class level so that students are neither bored nor overwhelmed by the length and detail of the directions

or practice. This would, in fact, be a sharp departure from typical standardized test directions.

The traditional understanding of standardization is that it refers to the behaviors of the test

administrator and the testing environment rather than to the readiness of the examinees to engage in the

test. In this behaviorist framework, fairness is achieved when every student hears the same directions and

sees the same practice items (c.f., Cronbach, 1970; Geisinger, 1998; Norris et al., 2004). However, when

students vary greatly in their familiarity with testing, the typical administration procedures and test

directions provided may fail to homogenize the test experience of examinees. For complex and unfamiliar

formats, test developers would benefit from broadening their definition of standardization to include

homogeneity of the examinee experience and change the focus of directions from standardization to

effective instruction.

Running Head: TEST DIRECTIONS AND FAMILIARITY 12

To accomplish this change, mainstream testing might benefit from incorporating elements of the

dynamic assessment and learning potential assessment tradition to adapt directions and practice

opportunities. Following these traditions, test direction activities would ensure that every student is on

equal standing with respect to understanding the task, motivation to do their best, and familiarity and test-

wiseness related to the format (Budoff, 1987; Caffrey, Fuchs, & Fuchs, 2008; Kaufman, 2010; Ortar,

1972; Rand & Kaniel, 1987; Tzurkiel & Haywood, 1992). In some circles, this is called the “Betty Hagen

rule”1: “Do not start the test until you are sure every child understands what he or she is supposed to do.”

Like Hagen, these interactive testing traditions recognize that fairness comes from adaptation to

students rather than standardization of administration and that, in fact, equitable treatment can produce

inequality (Brennan, 2006b). Hattie, Jaeger, and Bond (1999) capture this tension between standardization

and adaptation as the key to fairness in item formats and administration:

“The manner under which tests are administered has become a persistent methodological

question, primarily because of the tensions between the desire to provide standardized conditions

under which the test or measure is administered (to ensure more comparability of scores) and the

desire for representativeness or authenticity in the testing condition (to ensure more meaning in

inferring from the score).” (p. 395)

When students vary widely in their familiarity with tests and their readiness to show their best work,

fairness is likely best achieved through increased flexibility in test administration (e.g., additional practice

for some, changes in directions, comprehension checks) rather than rigid adherence to standardized

procedures.

In treating test directions as an instructional activity, the directions must be adapted to examinee

characteristics and pay attention to motivation. Young students may benefit especially from basic test-

wiseness training, including self-pacing and tactics to reduce impulsive responding. In a study of test

1 Hagen was the co-author of several forms of the Cognitive Abilities Test (CogAT) among many other tests and well-versed in designing tests for children. This is hear-say wisdom from her CogAT Form 6 co-author, David Lohman.

Running Head: TEST DIRECTIONS AND FAMILIARITY 13

directions for young students, Lakin (2010) found that multiple tactics and teacher interventions were

needed within the directions to slow students down and to encourage them to consider all of the response

options before selecting an answer. Even young students seem quick to assume they understand the task,

consistent with the minimal effect of written directions on older students in an experimental design

(LeFevre & Dixon, 1986).

For abstract item formats, such as those used on ability tests and critical thinking tasks, the

directions should also try to introduce a basic strategy for solving the early items so that students feel

successful and can adapt the strategy as items increase in difficulty (Budoff et al., 1974). Familiarity with

successful strategies is certainly an advantage for more experienced test-takers (Byrnes & Takahira, 1993;

Marshalek et al., 1983; Prins, Veenman, & Elshout, 2006), but may be minimized by teaching a basic

strategy. In this way, the first few test items (also called “teaching items”) should start easy and increase

in difficulty so as to lead the child to develop appropriate strategies by gradually introducing more

complex rules (Cronbach, 1984). For introducing complex simulations or tools, test developers should

similarly plan to introduce tools as needed and preferably one at a time so that the examinee is not

overwhelmed with new cognitive demands (Gee, 2003).

Use the Simplest Language Needed, But No Simpler

In writing test directions, the primary goal must be clarity in conveying information to examinees.

Directions that are too long lead to inattention and directions that are too vague or brief will not serve the

intended purpose. There is thus a fine line to walk when every student will receive the same directions.

The trend in abilities assessment seems to be towards using the shortest instructions possible (even to the

somewhat extreme “nonverbal directions” procedure; Lakin, 2010; McCallum et al., 2001) with the idea

that less language is more culture-fair. However, this may not be the best method for closing gaps in

understanding between culturally and linguistically diverse students. In fact, shorter directions—where

repetitions, explanations, and logical conjunctions are dropped—may actually increase the burden of

listening comprehension for examinees (Baker, Atwood, & Duffy, 1988; Davison & Kantor, 1982).

Running Head: TEST DIRECTIONS AND FAMILIARITY 14

Furthermore, even directions that seem short and clear to the adult test developer may contain an

undesirable number of basic concepts when testing young students. In traditional test administration,

basic concepts are essential to explaining the task, especially when a practice item is presented. Telling

the examinees to “look at the top row of pictures” or asking, “What happens next?” both involve basic

concepts that young students may not know. As a result, the directions for most widely used intelligence

tests contain many more basic concepts than one would expect or desire on a test for young students

(Flanagan, Kaminer, Alfonso, & Rader, 1995; Kaufman, 1978).

From their work in cross-cultural testing, van de Vijver and Poortinga (1992) argue that

directions should “rely minimally on explanations in which the correct understanding of the verbalization

of a key idea is essential” (p. 20). In their view, visual examples and verbal descriptions must work

together with appropriate verbal repetition and feedback to support comprehension so that a single

moment of inattention or misunderstanding does not derail an examinee’s performance. Indeed, rather

than arguing for short directions, van de Vijver and Poortinga (1992) argue that increasing the amount of

instructions and practice can actually reduce the cultural loading of a test. More language decreases how

critical it is that students quickly grasp the rules of the task from a few examples. A liberal number of

examples and exercises, they say, can overcome relatively small group differences in familiarity.

Biesheuvel (1972) likewise argued that familiarization with test demands is essential before beginning a

test in a cross-cultural context: practitioners often fail “to recognize the desirability of prolonged pre-test

practice in dealing with the material which constitutes the content of the test and of introducing a learning

element in the test itself” (Biesheuvel, 1972, p. 48).

The best compromise therefore seems to be appropriately detailed explanations that are firmly

concretized by examples and visuals. Visuals are a way of reducing language load without reducing

information (Mayer, 2001). Another solution is to rely on computer animations: one important benefit of

computerized tests is the potential for directions to show rather than tell test-takers where to direct their

attention. This approach reduces the amount of explanation required and the corresponding language load.

Running Head: TEST DIRECTIONS AND FAMILIARITY 15

Use More Practice Items

In treating the directions as an instructional activity, the test developer should pay special

attention to choosing the best practice items and tactics to teach the students how to respond to the task.

For example, the first few items should not provide opportunities to succeed using inappropriate

strategies. Misconceptions about the task introduced by the test directions and early test items are a

serious threat to validity. For example, Lohman and Al-Mahrzi (2003) found that the use of certain item

distractors (that is, the incorrect options to an item) unintentionally reinforced a misunderstanding of the

task that was only discovered through an analysis based on personal standard errors of measurement.

Likewise, Kaufman (2010) reported that the content of early items on the WISC Similarities tests

unintentionally steered students towards giving lower-point answers. Thus, it is clear that test directions,

practice items, and early test items must work together to prevent misunderstandings.

The choice and number of practice items is no small consideration in the design of good test

directions and practice. In fact, research shows that test developers should design directions and examples

keeping in mind that students will rely more heavily on the examples than verbal descriptions to get a

sense of the task. LeFevre and Dixon (1986) compared the performance of students on two item formats

when examples and instructions for the tests conflicted (e.g., the words described a figure series test but

the example showed a figure classification problem). No matter how the experimenters varied the

emphasis on the directions, the college-aged students used the example as their primary source of

information on how to do item and virtually ignored the written directions.

However, LeFevre and Dixon (1986) warn that “the present conclusions probably would not

apply to completely novel tasks; in such cases a single example might be insufficient to induce any

procedure at all” (p. 28; see also Bergman, 1980). LeFevre and Dixon argue that examinees need both

examples to help concretize procedures for tests and verbal instructions to help abstract the important

features from the examples. Feist and Gentner (2007) concur, finding that verbal directions were essential

to drawing attention to important features of test items, teaching basic strategies, and giving feedback on

practice items. “In speaking, we are induced by the language we use to attend to certain aspects of the

Running Head: TEST DIRECTIONS AND FAMILIARITY 16

world while disregarding or de-emphasizing others” (p. 283). Although examples by themselves may be

less useful to naïve test-takers, clearly, they play an important role in helping examinees develop specific

strategies and procedures for solving test items (Anderson, Farrell, & Sauers, 1984).

An additional benefit of using more practice items is that they provide a more representative

sample of test items (Haslerud & Meyers, 1958; Kittell, 1957; van de Vijver & Poortinga, 1992).

Presenting a range of realistic examples is important. Examples that are too simplified do not expose

students to the full range of item types and may lead them to develop ineffective strategies for the task,

while a range of examples can introduce strategies and allow examinees to anticipate the range of

challenges in the test (Jacobs & Vandeventer, 1971; Morrisett & Hovland, 1959). Furthermore, several

studies showed that schemes are better built from multiple practice items than from long explanations

(Anderson, Fincham, & Douglass, 1997; Klausmeier & Feldman, 1975; Morrisett & Hovland, 1959; Ross

& Kennedy, 1990).

Morrisett and Hovland (1959) also found that presenting a range of different practice items was

essential because it taught examinees to discriminate between task-relevant and task-irrelevant problem

features. However, they also found limits to how much variety is beneficial. Although they found that

extensive practice on just one type of item led to negative transfer on diverse items, they also found that

sufficient practice on a set of three item types was more effective than brief practice on 24 item types.

Thus, the best approach in selecting practice items is to identify a few broad categories of items and help

examinees develop workable but general strategies that can be adapted to different items within these

categories.

The relative timing of directions and example items is also a consideration. Lakin (2010) found

that the preamble of test purpose and describing the task that many tests use may not be ideal and that

good test directions should engage students (especially young students) in responding to practice items

almost immediately rather than expecting them to listen to extensive explanations first. In fact, some

research has shown that it may be better to let the students see examples before they are given any

Running Head: TEST DIRECTIONS AND FAMILIARITY 17

description of the task at all in order to make the description more concrete (Sullivan & Skanes, 1971).

This puts the onus of explanation on the feedback to responses rather than directions.

Provide Elaborative Feedback

Practice or example items can play an important role in demonstrating the format rules to

students. However, practice alone does not lead to optimal performance when it is not paired with

appropriate feedback. The research indicates that (1) feedback makes practice more efficient for learning

and (2) elaborative feedback that goes beyond yes/no verification improves learning further.

Practice with feedback is more efficient. The importance of feedback was established early in

the history of psychology (Thorndike, 1927). Morrisett and Hovland (1959) showed that feedback

increased the amount of learning from each practice item, whereas providing no feedback resulted in the

need for many more practice items to produce the same amount of learning. Furthermore, practice without

feedback has the tendency to make most students faster but not better at task (Thorndike, 1914, Ch. 14),

because such practice leads to automatization of whatever strategy the student is using, whether efficient

or not, and not necessarily to improvement in performance.

Several studies support the need to provide feedback for practice items on tests (e.g., Hessels &

Hamers, 1993; Parshall et al., 2000; Sullivan & Skanes, 1971; Whitely & Dawis, 1974). In one example,

Tunteler, Pronk, and Resing (2008) conducted a microgenetic study of first-grade students solving figure

analogies and showed that although all students improved their performance from practice alone, targeted

feedback in a dynamic training session greatly increased performance. Importantly, Sullivan and Skanes

(1971) found that feedback had a greater effect for low-ability examinees, who gained much less from

practice without feedback than high-ability examinees. This latter finding is consistent with research

showing that high-ability examinees learn more from practice without feedback because they can more

often correct their own errors (Shute, 2007).

The positive effect of feedback for culturally and linguistically diverse students has been

confirmed in cross-cultural research as well. Hessels and Hamers (1993) and Resing, Tunteler, de Jong,

Running Head: TEST DIRECTIONS AND FAMILIARITY 18

and Bosma (2009) found that dynamic testing approaches with tailored feedback and practice

significantly increased the performance of language- and ethnic-minority students on an ability test.

Elaborative feedback increases learning from practice problems. Several researchers have

investigated what degree of feedback and explanations are optimal (e.g., Judd, 1908; Kittell, 1957;

Sullivan, 1964). In her review of the literature, Shute (2007) concluded that elaborative feedback

provided in tutoring systems resulted in larger gains in comprehension compared to simple verification of

the answer. Elaborative feedback could include both a brief explanation of why the chosen answer was

correct as well as information on why other attractive or plausible distractors are wrong.

Ideally, the person administering the test should also be aware of what feedback the examinees

are receiving as well so that the examiner can determine that the examinee has sufficiently understood the

directions (van de Vijver and Poortinga, 1992). This is rarely a component of traditional test directions,

but in many cases could be easily added to the administration process (e.g., directing the administrator to

walk around during the practice items). Note that examiners would also need to be provided with

information on how to provide additional feedback to students who are not completing the practice items

correctly. Such comprehension checks should allow the examiner to be confident that all students

understand before the test begins. Computer-based administration of tests or practice tests would also be

effective in tracking practice item performance and routing students to additional practice.

Collect Evidence on the Quality of Directions

Test directions are a critical component of test design that influence both the construct measured

and the precision of that measurement (Clemans, 1971; Cronbach, 1970), yet they rarely receive

systematic evaluation to confirm that they are effective in reducing pre-existing differences in test

familiarity among examinees. Ideally, directions would be evaluated as a part of the validity argument for

a test to assure that they adequately familiarize all students with the task. Examples of appropriate

research methods for this purpose have been reviewed earlier in this article (LeFevre & Dixon, 1986;

Fuchs et al., 2000; Lohman & Al-Mahrzi, 2003; te Nijenhuis et al., 2007; see also Leighton & Gokiert,

2005). Likewise, researchers should evaluate the magnitude of practice effects (overall and for specific

Running Head: TEST DIRECTIONS AND FAMILIARITY 19

students) as one check to determine how much students gain from changes in test familiarity alone,

because this can threaten score interpretation (particularly longitudinal growth estimates).

The age, test experience, and background characteristics of the examinees clearly matters in the

design of directions and warrants additional research. First, young students clearly require more

orientation and distributed practice opportunities2 to help them understand the test tasks (Dreisbach &

Keogh, 1982; Lakin, 2010), although finding the optimal method requires more research. In contrast, the

particular issue with older examinees seems to be getting them to attend to the directions (LeFevre &

Dixon, 1986), a challenge that has also not been addressed with empirical research.

Second, the background characteristics of the students, whether in terms of cultural differences

that influence test-wiseness (Biesheuvel, 1972) or linguistic differences (Dreisbach & Keogh, 1982; Ortiz

& Ochoa, 2005), may influence the type and amount of directions and practice that are needed. In schools

with large populations of culturally and linguistically diverse students, the adequacy of test directions to

create scores that reflect valid comparisons of student performance is an even greater concern and

requires additional fairness research evidence. Other background characteristics should also be

considered, when appropriate. One example is the Matthew effect, where gains from practice

opportunities differentially benefit the most able students, at least on cognitive ability tests (Jensen, 1980;

Sullivan, 1964; te Nijenhuis et al., 2007). One estimate of the Matthew effect comes from Kulik et al.’s

(1984) meta-analysis, which estimated the effects of practice to be .17SD for low ability students, .40 for

middle ability, and .82 for high ability (note that this was only for identical tests—the differential gains

were much smaller for parallel forms, d=.07, .26, .27, respectively). Test developers should consider such

complications and avoid unintentionally exacerbating disparities.

2 It is important to note that teaching test-related skills does not necessarily have to happen in isolation and extend the time spent on test preparation. Integrated instruction/assessment systems, such as the CBAL assessment system (Bennett et al., 2010; see also DiCerbo & Behrens, 2012), embed instruction related to learning the assessment system and tools into meaningful content instruction.

Running Head: TEST DIRECTIONS AND FAMILIARITY 20

Conclusions about the Design of Test Directions

Complex and novel item formats play an important role in both cognitive ability and achievement

testing. Introducing new item formats can yield large familiarity or practice effects when students first

encounter the format which can influence evaluation at the individual or group level (Jensen, 1980; Fuchs

et al., 2000). When access to test preparation varies, particularly when tests have high individual stakes

attached to their results, differential novelty between students becomes a serious threat to fairness and the

validity of interpretations from scores as well. In the diverse populations many schools serve, equal

familiarity with testing and access to test preparation cannot be assumed.

Based on the literature review, several guidelines for creating directions were substantiated. It

was clear that directions should be treated as an instructional activity that engages students in the task, use

clear examples, provide elaborative feedback, and use age/level-appropriate language. The literature also

suggests that when students vary widely in their familiarity with tests, fairness is likely best achieved

through increased flexibility in test administration (e.g., additional practice for some, changes in

directions) rather than rigid adherence to standardized procedures.

A critical area for future research is the use of computer-based adaptive directions that could

provide the needed flexibility in practice and feedback. Even for paper-based tests, computer-based

practice activities could be beneficial in addressing the suggestions for directions and practice made in

this article. Computers can respond adaptively to what students can already do and provide more or less

practice as needed. Computer-based practice and directions could even provide basic test-wiseness

strategies for students who lack any familiarity with tests; training that would be inappropriate and boring

for other students. They can also permit the sort of semi-structured trial-and-solution strategy that

students commonly use when learning new computer games (Gee, 2003). Computer-based directions

could also target the feedback to provide only useful (and brief) corrections for errors made on practice

items and support this feedback with appropriate visual demonstrations. A further benefit of computer-

based practice would be that examinees could elect to hear the directions in one of several languages.

Running Head: TEST DIRECTIONS AND FAMILIARITY 21

Regardless of the format of the test, the most important lesson from the literature is that test

directions are a vital contributor to test validity, should be systematically studied and designed, and

should be evaluated as part of the process of compiling a validity argument. Unless test developers and

users are certain that all students have had equal opportunity to understand the task presented, bias may be

introduced through systematic differences in test familiarity.

Running Head: TEST DIRECTIONS AND FAMILIARITY 22

References

American Educational Research Association, American Psychological Association, & National Council

on Measurement in Education. (1999). Standards for educational and psychological testing.

Washington, DC: American Psychological Association.

Anastasi, A. (1981). Coaching, test sophistication, and developed abilities. American Psychologist, 36,

1086-1093.

Anderson, J. R., Fincham, J. M., & Douglass, S. (1997). The role of examples and rules in the acquisition

of a cognitive skill. Journal of Experimental Psychology Learning Memory and Cognition, 23,

932-945.

Baker, E. L., Atwood, N. K., & Duffy, T. M. (1988). Cognitive approaches to assessing the readability of

text. In A. Davison, & G. M. Green (Eds.), Linguistic complexity and text comprehension:

Readability issues reconsidered (pp. 55-83). Hillsdale, NJ: Lawrence Erlbaum.

Bennett, R.E., Persky, H., Weiss, A., & Jenkins, F. (2010). Measuring problem solving with technology:

A demonstration study for NAEP. Journal of Technology, Learning, and Assessment, 8.

Retrieved July 23, 2010, from http://www.jtla.org

Bergman, I. B. (1980). The effects of test-taking instruction vs. practice without instruction on the

examination responses of college freshmen to multiple-choice, open-ended, and cloze type

questions. Dissertation Abstracts International: Section A. Humanities and Social Sciences, 41(1-

A), 176.

Biesheuvel, S. (1972). Adaptability: Its measurement and determinants. In L. J. Cronbach, & P. J. D.

Drenth (Eds.), Mental tests and cultural adaptation (pp. 47-62). The Hague: Mouton.

Bond, L. (1989). The effects of special preparation on measures of scholastic ability. In R. L. Linn (Ed.),

Educational measurement (3rd ed., pp. 429-444). New York, NY: American Council on

Education/Macmillan.

Brennan, R. L. (Ed.). (2006a). Educational measurement (4th ed.). Westport, CT: American Council on

Education/Praeger.

Running Head: TEST DIRECTIONS AND FAMILIARITY 23

Brennan, R. L. (2006b). Perspectives on the evolution and future of educational measurement. In R. L.

Brennan (Ed.), Educational measurement, fourth edition (pp. 1-16). Westport, CT: American

Council on Education/Praeger.

Briggs, D. C. (2001). The effect of admissions test preparation: Evidence from NELS: 88. Chance, 14(1),

10-21.

Budoff, M. (1987). The validity of learning potential assessment. In C.S. Lidz (Ed.), Dynamic assessment

(pp. 52-81). New York, NY: Guilford Press.

Budoff, M., Gimon, A., & Corman, L. (1974). Learning potential measurement with Spanish-speaking

youth as an alternative to IQ tests: A first report. InterAmerican Journal of Psychology, 8, 233-

246.

Byrnes, J.P., & Takahira, S. (1993). Explaining gender differences on SAT-Math items. Developmental

Psychology, 29, 805-810.

Caffrey, E., Fuchs, D., & Fuchs, L.S. (2008). The predictive validity of dynamic assessment: A review.

Journal of Special Education, 41, 254-270.

Center for K–12 Assessment & Performance Management (2011, July). Coming together to

raise achievement: New assessments for the Common Core State Standards. Princeton,

NJ: Educational Testing Service.

Chrispeels, J. H., & Rivero, E. (2001). Engaging Latino families for student success: How parent

education can reshape parents' sense of place in the education of their children. Peabody Journal

of Education, 76(2), 119-169.

Clemans, W. V. (1971). Test administration. In R. L. Thorndike (Ed.), Educational measurement (2nd

ed., pp. 188-201). Washington, D.C.: American Council on Education.

Cohen, A. S., & Wollack, J. A. (2006). Test administration, security, scoring, and reporting. In R. L.

Brennan (Ed.), Educational measurement (4th ed., pp. 355-386). New York, NY: American

Council on Education/Macmillan.

Running Head: TEST DIRECTIONS AND FAMILIARITY 24

Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York, NY: Harper and Row.

Cronbach, L. J. (1984). Essentials of psychological testing (4th ed.). New York, NY: Harper and Row.

Davison, A., & Kantor, R. (1982). On the failure of readability formulas to define readable texts: A case

study from adaptations. Reading Research Quarterly, 17, 187-209.

DiCerbo, K.E., & Behrens, J.T. (2012). Implications of the digital ocean on current and future assessment.

In. R.W., Lissitz & H. Jiao, Computers and their impact on state assessments: Recent history and

predictions for the future (pp. 273-306). Charlotte, NC: Information Age Publishing, Inc.

Dreisbach, M., & Keogh, B. K. (1982). Testwiseness as a factor in readiness test performance of young

Mexican-American children. Journal of Educational Psychology, 74(2), 224-229.

Ebel, R. (1965). Measuring educational achievement. Englewood Cliffs, N. J: Prentice-Hall.

Evans, F. R., & Pike, L. W. (1973). The effects of instruction for three mathematics item formats. Journal

of Educational Measurement, 10(4), 257-271.

Fairbairn, S. (2007). Facilitating greater test success for English language learners. Practical Assessment,

Research, & Evaluation, 12(11). Retrieved from http://pareonline.net/getvn.asp?v=12&n=11

Fairbairn, S., & Fox, J. (2009). Inclusive achievement testing for linguistically and culturally diverse test

takers: Essential considerations for test developers and decision makers. Educational

Measurement: Issues & Practice, 28(1),10-24. doi:10.1111/j.1745-3992.2009.01133.x

Federal Interagency Forum on Child and Family Statistics. (2011). America’s children: Key national

indicators of well-being. Washington, DC: U.S. Government Printing Office.

Feist, M. I., & Gentner, D. (2007). Spatial language influences memory for spatial scenes. Memory and

Cognition, 35(2), 283-296.

Flanagan, D. P., Kaminer, T., Alfonso, V. C., & Rader, D. E. (1995). Incidence of basic concepts in the

directions of new and recently revised American intelligence tests for preschool children. School

Psychology International, 16(4), 345-364.

Running Head: TEST DIRECTIONS AND FAMILIARITY 25

Fuchs, L.S. , Fuchs, D., Karns, K., Hamlett, C.L., Dutka, S., & Katzaroff, M. (2000) The importance of

providing background information on the structure and scoring of performance assessments.

Applied Measurement in Education, 13, 1 - 34.

Gee, J.P., (2003). What Video Games Have to Teach Us About Learning and Literacy. ACM Computers

in Entertainment 1(1), 1-4.

Geisinger, K.F. (1998). Psychometric issues in test interpretation. In J. Sandoval, C.L. Frisby, K.F.

Geisinger, J.D. Scheuneman, & J.R. Grenier, Test interpretation and diversity: Achieving equity

in assessment (pp. 17-30). Washington, DC: American Psychological Association.

Glutting, J. J., & McDermott, P. A. (1989). Using “teaching items” on ability tests: A nice idea, but does

it work? Educational and Psychological Measurement, 49, 257-268.

Greenfield, P. M., Quiroz, B., & Raeff, C. (2000). Cross-cultural conflict and harmony in the social

construction of the child. New Directions for Child and Adolescent Development, 2000 (87), 93-

108.

Haladyna, T.M., & Downing, S.M. (2004). Construct-irrelevant variance in high-stakes testing.

Educational Measurement: Issues and Practice, 23(1), 17-27

Haslerud, G. M., & Meyers, S. (1958). The transfer value of given and individually derived principles.

Journal of Educational Psychology, 49(6), 293-298.

Hattie, J., Jaeger, R.M., & Bond, L. (1999). Persistent methodological questions in educational testing.

Review of Research in Education, 24, 393-446.

Hessels, M. G. P., & Hamers, J. H. M. (1993). The learning potential test for ethnic minorities. In J. H. M.

Hamers, K. Sijtsma & Ruijssenaars, A. J. J. M. (Eds.), Learning potential assessment:

Theoretical, methodological and practical issues (pp. 285-313). Amsterdam: Swets & Zeitlinger.

Jacobs, P. I., & Vandeventer, M. (1971). The learning and transfer of double-classification skills by first

graders. Society for Research in Child Development, 42, 149-159.

James, W. S. (1953). Symposium on the effects of coaching and practice in intelligence tests: Coaching

for all recommended. British Journal of Educational Psychology, 23, 155-162.

Running Head: TEST DIRECTIONS AND FAMILIARITY 26

Jensen, A.R. (1980). Bias in Mental Testing. New York, NY: The Free Press.

Judd, C. H. (1908). The relation of special training to general intelligence. Educational Review, 36(28),

42.

Kaufman, A. S. (1978). The importance of basic concepts in individual assessment of preschool children.

Journal of School Psychology, 16, 207-211.

Kaufman, A.S. (2010). "In what way are apples and oranges alike?": A critique of Flynn's interpretation

of the Flynn Effect. Journal of Psychoeducational Assessment. Advance online publication. doi:

10.1177/0734282910373346

Kingsbury, G.G., & Wise, S.L. (2012). Turning the page: How smarter testing, vertical scales, and

understanding of student engagement may improve our tests. In. R.W. Lissitz & H. Jiao,

Computers and their impact on state assessments: Recent history and predictions for the future

(pp. 245-269). Charlotte, NC: Information Age Publishing, Inc.

Kittell, J. E. (1957). An experimental study of the effect of external direction during learning on transfer

and retention of principles. Journal of Educational Psychology, 48(6), 391-405.

Klausmeier, H. J., & Feldman, K. V. (1975). Effects of a definition and a varying number of examples

and nonexamples on concept attainment. Journal of Educational Psychology, 67(2), 174-178.

Kopriva, R. J. (2008). Improving testing for English language learners. New York, NY: Routledge.

Kulik, J. A., Kulik, C. C., & Bangert, R. L. (1984). Effects of practice on aptitude and achievement test

scores. American Educational Research Journal, 21, 435-447.

Lakin, J.M. (2010). Comparison of test directions for ability tests: Impact on young English-language

learner and non-ELL students. (Unpublished doctoral dissertation). The University of Iowa, Iowa

City. Retrieved from http://ir.uiowa.edu/etd/536

Lazer, S. (December 10, 2009). Technical challenges with innovative item types [Video from presentation

at National Academies of Science]. Retrieved August 3, 2010, from

http://www7.nationalacademies.org/bota/Steve%20Lazer.pdf

Running Head: TEST DIRECTIONS AND FAMILIARITY 27

LeFevre, J. A., & Dixon, P. (1986). Do written instructions need examples? Cognition and Instruction, 3,

1-30.

Leighton, J.P., & Gokiert, R.J. (April, 2005). The cognitive effects of test item features: Informing item

generation by identifying construct irrelevant variance. Paper presented at the Annual Meeting of

the National Council on Measurement in Education, Montreal, Quebec, Canada. Retrieved from:

http://www2.education.ualberta.ca/.../NCME_Final_Leigh&Gokiert.pdf

Lindquist, E. F. (Ed.). (1951). Educational measurement. Washington, DC: American Council on

Education.

Linn, R. L. (Ed.). (1989). Educational measurement. New York, NY: American Council on

Education/Macmillan.

Lissitz, R.W., & Jiao, H. (2012). Computers and their impact on state assessments: Recent history and

predictions for the future. Charlotte, NC: Information Age Publishing, Inc.

Lohman, D. F., & Al-Mahrzi, R. (2003). Personal standard errors of measurement. Retrieved from

http://faculty.education.uiowa.edu/dlohman/pdf/personalSEMr5.pdf

Marshalek, B., Lohman, D. F., & Snow, R. E. (1983). The complexity continuum in the radex and

hierarchical models of intelligence. Intelligence, 7, 107-128. Martinez, M.E. (1999). Cognition

and the question of test item format. Educational Psychologist, 34, 207-218.

Martinez, M.E. (1999). Cognitive and the question of test item format. Educational Psychologist, 34(4),

207-218.

Maspons, M. M., & Llabre, M. M. (1985). The influence of training Hispanics in test taking on the

psychometric properties of a test. Journal for Research in Mathematics Education, 16, 177-183.

Mayer, R. E. (2001). Multimedia learning. New York, NY: Cambridge University Press.

McCallum, R. S., Bracken, B. A., & Wasserman, J. D. (2001). Essentials of nonverbal assessment.

Hoboken, NJ: Wiley.

Millman, J., Bishop, C.H., & Ebel, R. (1965). An analysis of test-wiseness. Educational and

Psychological Measurement. 25(3), 707-726.

Running Head: TEST DIRECTIONS AND FAMILIARITY 28

Millman, J., & Greene, J. (1989). The specification and development of tests of achievement and ability.

In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 335-355). New York, NY: American

Council on Education/Macmillan.

Morrisett, L., & Hovland, C. I. (1959). A comparison of three varieties of training in human problem

solving. Journal of Experimental Psychology, 58(1), 52-55.

National Center for Education Statistics (2012, June). Science in action: Hands-on and interactive

computer tasks from the 2009 science assessment. Retrieved from

https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2012468

te Nijenhuis, J., van Vianen, A.E.M., & van der Flier, H. (2007). Score gains on g-loaded tests: No g.

Intelligence, 35, 283-300. doi:10.1016/j.intell.2006.07.006

te Nijenhuis, J., Voskuijl, O.F., & Schijve, N.B. (2001). Practice and coaching on IQ tests: Quite a lot of

g. International Journal of Selection and Assessment, 9(4), 302-308.

Norris, S.P. Leighton, J.P., & Phillips, L.M. (2004). What is at stake in knowing the content and

capabilities of children's minds? A case for basing high stakes tests on cognitive models. Theory

and Research in Education, 2, 283–308. doi: 10.1177/1477878504046524

Oakland, T. (1972). The effects of test-wiseness materials on standardized test performance of preschool

disadvantaged children. Journal of School Psychology, 10, 355-360.

Ortar, G. R. (1960). Improving test validity by coaching. Educational Research, 2, 137-142.

Ortar, G. (1972). Some principles for adaptation of psychological tests. In L. J. Cronbach, & P. J. D.

Drenth (Eds.), Mental tests and cultural adaptation (pp. 111-120). The Hague: Mouton.

Ortiz, S. O., & Ochoa, S. H. (2005). Conceptual measurement and methodological issues in cognitive

assessment of culturally and linguistically diverse individuals. In R. L. Rhodes, S. H. Ochoa & S.

O. Ortiz (Eds.), Assessing culturally and linguistically diverse students: A practical guide (pp.

153-167). New York: Guilford Press.

Running Head: TEST DIRECTIONS AND FAMILIARITY 29

Parshall, C.G., Davey, T., & Pashley, P.J. (2000). Innovative item types for computerized testing. In W.J.

van der Linden & C.A.W. Glas, Computerized adaptive testing: theory and practice (pp. 129-

148). The Netherlands: Kluwer Academic Publishers.

Passel, J.S., Cohn, D., & Lopez, M.H. (2011). Hispanics account for more than half of nation's growth in

past decade. Retrieved from http://pewresearch.org/pubs/1940/hispanic-united-states-population-

growth-2010-census

Peña, E. D., & Quinn, R. (1997). Task familiarity: Effects on the test performance of Puerto Rican and

African American children. Language, Speech, and Hearing Services in Schools, 28, 323-332.

Prins, F.J., Veenman, M.V.J., & Elshout, J.J. (2006). The impact of intellectual ability and metacognition

on learning: New support for the threshold of problematicity theory. Learning and Instruction,

16, 374-387.

Rand, Y., & Kaniel, S. (1987). Group administration of the LPAD. In C.S. Lidz (Ed.), Dynamic

assessment (pp. 196-214). New York, NY: Guilford Press.

Reese, L., Balzano, S., Gallimore, R., & Goldenberg, C. (1995). The concept of educación: Latino family

values and American schooling. International Journal of Educational Research, 23, 57-81.

Resing, W.C.M., Tunteler, E., de Jong, F.M., & Bosma, T. (2009). Dynamic testing in indigenous and

ethnic minority children. Learning and Individual Differences, 19, 445-450.

Ross, B. H., & Kennedy, P. T. (1990). Generalizing from the use of earlier examples in problem solving.

Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(1), 42-55.

Sattler, J. (2001). Assessment of children: Cognitive applications. San Diego: Jerome M. Sattler.

Scarr, S. (1981). Implicit messages: A review of “Bias in mental testing”. American Journal of

Education, 89(3), 330-338.

Scarr, S. (1994). Culture-fair and culture-free tests. In R. J. Sternberg (Ed.), Encyclopedia of Human

Intelligence (pp. 322–328). New York: Macmillan.

Running Head: TEST DIRECTIONS AND FAMILIARITY 30

Schwarz, P. A. (1971). Prediction instruments for educational outcomes. In R. L. Thorndike (Ed.),

Educational measurement (2nd ed., pp. 303-334). Washington, D.C.: American Council on

Education.

Shute, V. J. (2007). Focus on formative feedback (Research report No. RR-07-11). Retrieved from

Educational Testing Service website: http://www.ets.org/research/researcher/RR-07-11.html

Solano-Flores, G. (2008). Who is given tests in what language by whom, when, and where? The need for

probabilistic views of language in the testing of English Language Learners. Educational

Researcher, 37(4), 189-199.

Sullivan, A. M., & Skanes, G. R. (1971). Differential transfer of training in bright and dull subjects of the

same mental age. British Journal of Educational Psychology, 41(3), 287-293.

Sullivan, A. M. (1964). The relation between intelligence and transfer (Doctoral thesis, McGill

University, Montreal, Quebec).

Thorndike, E. L. (1914). Educational psychology: A briefer course. New York, NY: Teachers College,

Columbia University.

Thorndike, E. L. (1922). Practice effects in intelligence tests. Journal of Experimental Psychology, 5(2),

101.

Thorndike, E. L. (1927). The law of effect. American Journal of Psychology, 39(212), 222.

Thorndike, R. L. (Ed.). (1971). Educational measurement (2nd ed.). Washington, DC: American Council

on Education.

Traxler, A. E. (1951). Administering and scoring the objective test. In E. F. Lindquist (Ed.), Educational

measurement (1st ed., pp. 329-416). Washington, DC: American Council on Education.

Tunteler, E., Pronk, C., & Resing, W.C.M. (2008). Inter-and intra-individual variability in the process of

change in the use of analogical strategies to solve geometric tasks in children: A microgenetic

analysis. Learning and Individual Differences, 18, 44-60.

Running Head: TEST DIRECTIONS AND FAMILIARITY 31

Tzuriel, D., & Haywood, H.C. (1992). The development of interactive-dynamic approaches to assessment

of learning potential. In H.C. Haywood & D. Tzuriel, Interactive assessment (pp. 3-37). New

York, NY: Springer-Verlag.

van de Vijver, F., & Poortinga, Y. (1992). Testing in culturally heterogeneous populations: When are

cultural loadings undesirable. European Journal of Psychological Assessment, 8(1), 17-24.

Walpole, M., McDonough, P. M., Bauer, C. J., Gibson, C., Kanyi, K., & Toliver, R. (2005). This test is

unfair: Urban African American and Latino high school students' perceptions of standardized

college admission tests. Urban Education, 40(3), 321-349.

Whitely, S. E., & Dawis, R. V. (1974). Effects of cognitive intervention on latent ability measured from

analogy items. Journal of Educational Psychology, 66, 710-717.