music teachers' assessment literacy, beliefs, & practices

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES:

AN INTERVENTION STUDY

by

JOCELYN WENONA ARMES

B.A., Salisbury University, 2012

M.M., Ithaca College, 2016

A thesis submitted to the

Faculty of the Graduate School of the

University of Colorado in partial fulfillment

of the requirement for the degree of

Doctor of Philosophy

Department of Music Education

2020

Doctoral Committee:

James R. Austin, Chair

Margaret H. Berg

David A. Rickels

Tom Myer

Carolyn A. Haug

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES ii

Armes, Jocelyn W. (Ph.D., Music Education)

Music Teachers’ Assessment Literacy, Beliefs, & Practices

Dissertation Directed by Dr. James R. Austin

Abstract

While shifting priorities in educational policy have increased demand that teachers be

proficient in classroom assessment, teacher preparation programs have responded slowly, despite

evidence that preservice training in assessment may increase teachers’ knowledge and valuation

of assessment. Although professional development (PD) has been shown to change inservice

teachers’ knowledge and attitudes toward assessment, but researchers have not found evidence

that PD influences assessment practices. The purpose of this pretest-posttest control group study

was to examine the effect of an online PD on music teachers’ assessment literacy, beliefs, and

practices. Forty-three participants were randomly assigned to the control or intervention group and

completed the pretest and posttest. These questionnaires consisted of three measures: the

Classroom Assessment Literacy Inventory (CALI), the Music Teacher Assessment Implementation

Inventory (MTAII), and Music Teacher Assessment Beliefs Inventory (MTABI). Intervention group

participants (n = 18) enrolled in a four-week online PD focused on increasing music teacher

assessment literacy.

Following the PD, intervention participants demonstrated a significant increase in

assessment literacy scores compared to their control group peers; assessment beliefs and practices

did not significantly change over time. I found several significant relationships between

participants’ assessment literacy and their self-reported practices, as well as significant

relationships between participants’ assessment beliefs and practices. Implications for music

teacher PD in assessment, preservice teachers, and music teacher educators are discussed.

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES iii

Acknowledgements

Completing this dissertation, and this degree, would not have been possible without the

support of countless far-flung people in my life, for whom I am unspeakably grateful.

First, and always, thank you to my mother and grandmother -- Pamela Simson and Jackie

Fritch -- for being not only my staunchest supporters, but setting the standard for what true inner

strength, compassion, determination, patience, and a sense of justice can accomplish in service to

others. Without our family of brilliant and strong women, I do not know who I would be, or what

is possible. Mom, you are my inspiration as a teacher and human being. Grandmomsy Jackie,

even though you are no longer with us, I see you in every member of our family, and I am so

grateful for your influence and love. I love you both.

To my adviser, Dr. James “Papa Bear” Austin, I have no possible way of adequately

thanking you for your mentorship and support throughout not only this dissertation, but this

entire degree. It has been the privilege of my academic life to study with you, and to serve as

your editorial assistant for the Journal of Music Teacher Education. I will fondly recall our many

meetings, your kindness, humor, and encouragement. I am sure I drove you crazy at times with

my sense of scale and use of metaphors (e.g., yoga), but I suspect you enjoyed our conversations

as much as I did. I would not be the thinker, writer, researcher, or person I am today without you,

and I am undoubtedly better for it.

To my committee members -- Dr. Margaret Berg, Dr. David Rickels, Professor Tom

Myer, and Dr. Carolyn Haug. Thank you for your collective efforts and support throughout this

degree. I heard all your voices in my head throughout this process, and the final document is

more thoughtful and complete as a result. Thank you, Dr. Berg, for your support personally and

administratively of my research and success throughout this degree. While this project did not

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES iv

end up using a qualitative design, our conversations still guided my thinking about participants’

experiences throughout the intervention; you have shown me how to think in Technicolor as a

researcher, and I am grateful. Thank you, Dr. Rickels, for your support during my time at CU. I

have enjoyed learning from you in different contexts -- from MSE to research methods – and

have often found that “resistance is futile” when it comes to your enthusiasm for all things

spreadsheet-related. Thank you, Professor Myer, for serving on my committee and for your

support of my musicianship during my time at CU. I gleefully recall the first time we met as I

was touring campus; I accosted you in the hall and asked if we could discuss the Harbison

Sonata. You were gracious enough to spare me a few moments and have included me as an

honorary member of the studio ever since. To Dr. Haug, thank you for sharing your expertise in

assessment and education; your suggestions were some of the most insightful.

To my graduate cohort, your friendship was one of the most important aspects of my

experience during this degree. Our conversations always improved my thinking, and my spirits.

Special thanks to Seth Taft, Jacob Holster, and Ian Miller for also participating in our zany sax

quartet, and to Ellie Wolfe, Kate Bertelli-Wilinski, and Bryan Koerner for your friendship,

support, and wisdom from afar.

Finally, at risk of being verbose, thank you to the many colleagues, friends, and family

members flung, from music education to yoga, with whom I have crossed paths in this life. You

have taught me that nothing meaningful happens without community.

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES v

Table of Contents

Chapter

I. Introduction to the Study …………………………………………………………………...1

What Drives Teachers’ Assessment Practices? ............................................................2

Definitions & Purposes of Assessment ……………………………………………….9

Assessment Literacy ………………………………………………………………...14

Assessment Beliefs ………………………………………………………………….15

Teacher Preparation and Development ……………………………………………...17

Professional Development and Assessment Literacy………………………………..19

Study Need and Significance ………………………………………………………..22

Purpose and Research Questions ……………………………………………………25

Definitions …………………………………………………………………………..27

Delimitations ………………………………………………………………………..29

Assessment Literacy……………………………………………………………..29

Sampling ………………………………………………………………………...29

Measures ………………………………………………………………………...29

Researcher Interest …………………………………………………………………..31

II. Review of Related Literature ……………………………………………………………...34

Assessment Literacy ………………………………………………………………...35

...in Education ……………………………………………………………………35

...in Music Education ……………………………………………………………43

...in Summary …………………………………………………………………..44

Assessment Beliefs ………………………………………………………………….45

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES vi

...of Teachers …………………………………………………………………….46

...of Music Teachers ……………………………………………………………..57

...in Summary ……………………………………………………………………60

Assessment Practices ………………………………………………………………..61

...of Teachers …………………………………………………………………….62

...of Music Teachers ……………………………………………………………..69

...in Summary ……………………………………………………………………78

III. Methodology ………………………………………………………………………………..80

Research Design and Intervention …………………………………………………..81

Research Design ...………………………………………………………………81

Intervention Design ……………………………………………………………...82

Course Elements and Organization …………...………………………………...83

Population and Sample ……………………………………………………………...87

Selection and Design of Research Measures ………………………………………..88

Data Collection Instruments ………………………………………………………...90

Prescreening Questionnaire and Informed Consent …………………………….90

Assessment Literacy ……………………………………………….…………….91

Assessment Practices ……………………………………………………………92

Assessment Beliefs ……………………………………………………….……...94

Intervention Group Posttest ……………………………………………………..95

Procedures …………………………………………………………………………...95

Pilot Testing ……………………………………………………………………..95

Data Collection ………………………………………………………………….96

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES vii

Data Analysis ………………………………………………………………..…..97

IV. Results …………………………………………………………………………………..…100

Participant Demographics …………………………………………............………..….101

Reliability and Item Analysis …………………………...............................………..…104

CALI Reliability and Item Analysis …………………......................……..…....104

Correlations ……………………….................................................……105

Difficulty and Discrimination Indices ....................................................106

MTABI Reliability …………………………………………….................……..107

Descriptive Statistics for the CALI, MTAII, & MTABI ...................................................109

CALI Descriptives ...............................................................................................110

MTAII Descriptives .............................................................................................110

MTABI Descriptives ............................................................................................111

Research Questions .........................................................................................................112

Multivariate Analysis of Assessment Literacy and Beliefs .................................113

MANOVA Results ..................................................................................114

Nonparametric Analyses of Assessment Practices .............................................115

Mann-Whitney U Test Results ................................................................117

Spearman Rho Results ............................................................................118

Feedback from Intervention Participants ........................................................................121

Question One ......................................................................................................122

Question Two ......................................................................................................122

Question Three ....................................................................................................123

Question Four .....................................................................................................123

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES viii

Question Five .....................................................................................................124

V. Summary and Conclusions ................................................................................................125

Summary of Findings .....................................................................................................126

Assessment Literacy ...........................................................................................126

Assessment Beliefs ..............................................................................................127

Assessment Practices ..........................................................................................127

Relationships Between Assessment Literacy, Beliefs, and Practices ..................127

Discussion........................................................................................................................128

Major Findings ..................................................................................................128

Music Teachers Lack Prior Assessment Training ..................................128

Online Professional Development Formatting ........................................130

Assessment Literacy can be Impacted Through Intervention .................131

Assessment Beliefs are Related to Assessment Literacy ........................132

Assessment Beliefs Appear Stable Across Time ....................................133

Music Teachers’ Assessment Practices Vary and are

Largely Informal .........................................................................134

Assessment Beliefs and Assessment Literacy are

Related to Assessment Practices .................................................135

Music Teachers’ Assessment Practices May Be

Impacted by Other Factors .........................................................135

Music Teachers’ Educational Decision Making .................................................136

Role of Other Factors ..............................................................................139

Internal Factors ...........................................................................139

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES ix

External Factors ..........................................................................140

Tension .......................................................................................141

Classroom Realities ....................................................................142

Role of Socialization ...............................................................................143

Measuring Assessment Literacy, Beliefs, and Practices .....................................145

Calibration ...............................................................................................145

Dimensionality ........................................................................................146

Implications .....................................................................................................................149

...for Music Teacher Development ......................................................................149

...for Future Implementation of this Intervention ...............................................151

Study Limitations and Implementation Challenges ........................................................153

Sampling Procedure ............................................................................................153

Participant Attrition Between Stages ..................................................................156

History Effects Due to the COVID-19 Pandemic ...............................................156

Reliability of the CALI Instrument ......................................................................157

Recommendations for Future Researchers .....................................................................157

Conclusion ......................................................................................................................161

References .................................................................................................................................163

Appendices ................................................................................................................................180

A. IRB Approval Documentation ..............................................................................................180

B. Standards for Teacher Competence in Educational Assessment of Students (STCEAS)

Standards & Corresponding CALI Items ..............................................................................181

C. Original CALI by Mertler (2002) ..........................................................................................182

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES x

D. Music Teachers Assessment Workshop (MTAW) Design .....................................................192

E. Prescreening and Informed Consent Questionnaire & Informed Consent ............................199

F. Adapted Classroom Assessment Literacy Inventory (CALI) ................................................205

G. Music Teacher Assessment Implementation Inventory (MTAII) ...........................................210

H. Music Teacher Assessment Beliefs Inventory (MTABI) ........................................................212

I. Invitation to Participate .........................................................................................................214

J. Trigger Email Correspondence .............................................................................................215

K. CALI Distractor Item Analysis ..............................................................................................217

L. Intervention Participant Feedback .......................................................................................218

M. Teacher-Constructed Task Exemplar ....................................................................................220

N. Descriptive Data for the CALI, MTAII, and MTABI by Assigned Group .............................233

.

.

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES xi

List of Tables

Table

1.1 The Standards for Teacher Competence in Educational Assessment of Students...............4

2.1 Dissertations about Music Teacher Assessment Practices ................................................70

3.1 Participant Tasks ...............................................................................................................84

3.2 Sample Items for Subscales of Assessment Literacy in the Original CALI ......................92

4.1 Participant Descriptive Statistics ....................................................................................103

4.2 Correlations of Item Scores with CALI Standards ..........................................................106

4.3 Difficulty & Discrimination Indices ...............................................................................107

4.4 CALI Pretest and Posttest Descriptive Statistics .............................................................110

4.5 MTAII Pretest and Posttest Descriptive Statistics ...........................................................111

4.6 MTABI Pretest and Posttest Descriptive Statistics ..........................................................112

4.7 Mann-Whitney U Test Results on Assessment Practice Mean Change Scores ..............119

4.8 Spearman’s Rho Results .................................................................................................120

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES xii

List of Figures

Figure

1.1 Cyclical Response of National Organizations to Reform Efforts .......................................3

1.2 The Cyclic Nature of Assessment & Educational Decision Making in Instruction .........11

1.3 Teachers’ Classroom Assessment Decision Making ........................................................23

3.1 Conceptual Diagram of Study Procedures ........................................................................82

3.2 Perusall Discussion Board on an Assessment Text ..........................................................86

4.1 Pretest to Posttest Mean Literacy Scores ........................................................................116

4.2 Pretest to Posttest Mean Belief Scores ............................................................................117

5.1 Music Teachers’ Classroom Assessment Decision Making ...........................................137

5.2 Internal Factors ...............................................................................................................140

5.3 External Factors ..............................................................................................................141

5.4 Tensions ..........................................................................................................................142

5.5 Classroom Realities ........................................................................................................143

MUSIC TEACHERS’ ASSESSMENT LITERACY, BELIEFS, & PRACTICES 1

Chapter 1

Introduction to the Study

In his keynote address for the 2007 Florida Symposium on Assessment in Music

Education, Richard Colwell expressed the following sentiment:

“Music educators have two ideas about assessment: (1) we assess continually or (2) our

goals cannot be assessed. Murphy phrases this dichotomy as either “ardent passion or

blithe disregard.” Other than our fascination with aptitude tests, individual assessment

has been informal, and used in private instruction. Group assessment has been conducted

in classes, contests, festivals, and concerts. The professional literature in assessment is

focused on program assessment. However, music education, K-12, has no identifiable

programs.” (p. 3)

While perhaps harsh, Colwell’s commentary is relevant. Over the past thirty years — as national

organizations, policy makers, educational experts, and researchers have debated how to reform

education — the meaning of assessment has become distorted. Teachers now equate assessment

with standardized, high-stakes testing (Colwell, 2008; Heritage, 2007; Stiggins, 2014) and, as a

result, they assess learning in more narrow or calculated ways, and their efforts to improve how

they assess at the classroom level have been compromised.

Historically, educators have used assessments for the purposes of diagnosis,

accountability, and communication (Nierman & Colwell, 2019, p. 181). Music educators,

however, have narrower views and practices. As Colwell opined at the Florida Symposium,

music educators use informal assessments almost exclusively, or “believe that music is all

process or they believe that their classes/ensembles are continually being assessed in public

performances or by outside adjudicators in contests or festivals” (p. 181). Yet, there is an


argument to be made that “the music education profession has been about the business of

assessment since its inception” (Nierman & Colwell, 2019, p. 181); secondary music teachers, in

particular, claim to informally assess students’ performance through error detection. Here resides

the tension among music educators about the nature of assessment: they are both assessing

constantly, and not fully assessing in ways recognized by educational experts and stakeholders.

Since inception of the national standards movement, obtaining recognition for music as a

legitimate school subject has been a coveted goal of the music education profession and the

organizations that have represented and advocated for it (Colwell, 2008). Abdicating

responsibility for assessing students -- whether due to “ardent passion or blithe disregard” -- is

not a viable path for music educators interested in legitimizing the discipline. There are societal

and political forces that compel music educators to embrace the multiplicity of assessment

purposes, both professionally and for the benefit of their students.

In this chapter, I will discuss the historical rationales for assessment, the definitions and

primary purposes of assessment, the research surrounding assessment literacy, beliefs, and

practices, and the potential of online professional development to increase music teacher

assessment literacy and enhance their use of assessment within the classroom. Then, I will situate

the need for the present study within the literature, articulate my purposes for this intervention

study and my research questions, provide definitions for pertinent constructs, and describe

delimitations of the study. Finally, I will share my personal interest in this topic.

What Drives Teachers’ Assessment Practices?

Historically, teachers have changed their teaching and assessment practices in response to

external pressures from district-, state-, and federal-level policies and other contextual demands

within their jobs. The time between policy issuance and teacher response tends to lag, resulting


in incrementally wider swings in policy. This iterative process is depicted in Figure 1.1. The

most recent wave of education reform efforts can be traced to the publication of A Nation at Risk

in 1984. Perhaps in tandem with increased public demand for accountability and transparency,

the standards movement unintentionally led teachers away “from evaluating student knowledge

and ability for a successful life to an evaluation of the different school systems, different

teachers, schools, and practices” (Colwell, 2008, p. 6). Nonetheless, national standards, and the

assessment measures designed from them, provided educational stakeholders with clear,

measurable objectives of what students should know and be able to do.

Figure 1.1

Cyclical Response of National Organizations to Reform Efforts

National teacher organizations also responded to demands for increased instructional

quality in the early 1990’s. The seven Standards for Teacher Competence in Educational

Assessment of Students (STCEAS) were developed by the American Federation of Teachers

(AFT), National Council on Measurement in Education (NCME), and the National Education


Association (NEA) to address inadequate assessment training in teacher preparation programs.

These standards outline the competencies required for teachers to be considered assessment

literate (Table 1.1) (AFT, NCME, & NEA, 1990). The goal of these standards was to provide “a

guide for teacher educators in their work with teacher education programs, a self-assessment

guide for teachers, a guide for workshop instructors, and an impetus for educational

measurement instructors to conceptualize student assessment more broadly than had been done

in the past” (Brookhart, 2011, p. 3).

Table 1.1. The Standards for Teacher Competence in Educational Assessment of Students

Teachers should be skilled in...

1. … choosing assessment methods appropriate for instructional decisions.

2. … developing assessment methods appropriate for instructional decisions.

3. … administering, scoring and interpreting the results of both externally-produced and teacher-

produced assessment methods.

4. … using assessment results when making decisions about individual students, planning teaching,

developing curriculum, and school improvement.

5. … developing valid pupil grading procedures which use pupil assessments.

6. ... communicating assessment results to students, parents, other lay audiences, and other educators.

7. … recognizing unethical, illegal, and otherwise inappropriate assessment methods and uses of

assessment information.

The STCEAS also served as a subsequent blueprint for researchers to measure the

assessment literacy of preservice and inservice teachers, and design programming intended to

enhance their literacy. Beginning with the Teacher Assessment Literacy Questionnaire (TALQ)

(Impara et al., 1993), the Classroom Assessment Literacy Inventory (CALI) (Mertler, 2004), and

then the Assessment Literacy Inventory (ALI) (Mertler & Campbell, 2005), psychometricians and

education experts have measured assessment literacy almost exclusively through the


competencies described in STCEAS. This is not to say the STCEAS have been accepted without

criticism. A number of psychometricians and researchers have examined the internal structure of

the TALQ (Alkarusi, 2015), the CALI (Gotch & French, 2015; Ryan, 2019; Xu & Brown, 2016),

and the ALI (Hailaya et al. 2014). Critics have found literacy measures associated with these

inventories to have adequate reliability (i.e., internal consistency) for preservice teachers, but

insufficient reliability for inservice teachers. They also argue -- depending upon the analysis

technique employed -- that the factors (i.e., these inventories typically assign five items to each

standard) do not neatly correspond to the standards. Recently, scholars (DeLuca et al. 2016;

Gotch & French, 2014; Xu & Brown, 2016) have offered contemporary interpretations or

addendums to the original seven standards published thirty years ago. Brookhart (2011) and

Popham (2009) previously suggested modest updates to the standards, as well.

Brookhart (2011) published an updated list of competencies, increasing the total number

to eleven, in order to address 21st century concerns, such as assessing diverse learners and using

technology appropriately to facilitate assessment. Assessment scholar James Popham identified

thirteen target skills and knowledge areas corresponding to assessment literacy, including

identifying, constructing, implementing, and interpreting assessments or assessment data (2009).

Popham argued that to be assessment literate, teachers needed knowledge and skills relevant for

both classroom-level and large-scale assessments. These knowledge and skill targets reflect the

myriad functions of assessment in the current educational landscape, and extend the

competencies outlined by the STCEAS. Popham’s incorporation of the accountability functions

of assessment hold important implications for inservice teachers, who are increasingly asked to

improve student performance on high-stakes measures and evaluated, in part, on the basis of

those same measures.


In keeping with the iterative nature of educational reform and teacher response, the

arguably lax assessment environment of the early 1990’s (Brookhart, 2001) swung toward an

overemphasis on high-stakes testing during the 2000’s (Stiggins, 2002, p. 760). Reform in the

“Era of Accountability” has no doubt contributed to a widespread skepticism of assessment (i.e.

its accountability purpose, and the high-stakes, standardized methods often used) by teachers

(Barnes et al., 2017). Currently, the pendulum of education reform appears poised to swing in the

direction of classroom-based efforts to improve student outcomes, including the use of teacher-

devised assessments (Brookhart, 2011). Researchers have suggested that classroom-based

assessment can increase student motivation and engagement (Earl & Katz, 2006). If history is

any indication, however, teacher practices are unlikely to evolve in such a manner that the

benefits of classroom-level assessment are fully realized; teachers’ general mistrust and

misapplication of assessment may be difficult to shake (Pishghadam et al., 2014; Steinberg,

2008).

These societal trends have undoubtedly affected the assessment beliefs and practices of

music teachers. Perhaps due to the contextual demands of their positions, music teachers appear

to lag behind general education peers with regards to adopting sound assessment principles

(Austin & Russell, 2017; Russell & Austin, 2010), despite the efforts of national organizations to

instill awareness and knowledge about assessment principles and practices. In the 1990's, MENC

responded to the standards-based reform movement by (a) developing national standards for

music teachers (1994); (b) publishing a corresponding handbook, Performance Standards for

Music (1996); and (c) assembling a collection of 31 articles replete with rubrics, strategies, and

other assessment best practices, Spotlight on Assessment in Music Education (2001).


Nearly ten years later, leaders within the National Association for Music Education

(NAfME) were still responding to policy pressures to improve music teacher assessment

knowledge and practices. In 2009, NAfME released an official position statement on

“Assessment in Music Education” wherein they asserted that “assessment, and the accountability

that stems from the public dissemination of the results of assessment, are key components in

building quality instructional programs.” NAfME also addressed the challenges imposed by

high-stakes assessment and logistical obstacles of designing, implementing, and analyzing

classroom assessments for music educators. However, the most strongly worded statement was

about the responsibility of the music educator:

While the forms and content of music assessment may appropriately vary, some form of

regular assessment of music programs should be adopted. The assessment should

measure student learning across a range of standards representative of quality,

balanced music curriculum, including not only responding to music but also

creating and performing music. This assessment should serve the goal of educational

accountability by providing data that can be included in the school- or district-level

“report card” disseminated to the public as required by law. [Bolded by NAfME]

Colwell and other experts have argued that efforts to reform teachers’ assessment

practices have had “minimal impact upon the classroom” with regards to “any effort to improve

teaching and learning in music” (2008, p. 7). While teachers may be generally more aware of

sound assessment principles, they remain unlikely to employ them in practice (Sears, 2002).

Researchers conducting studies after the standards-based reforms of the 1990’s, but prior to the

accountability-based reforms of No Child Left Behind (2001), found that the majority of

teachers’ assessment practices tended to be informal (i.e., observational, large-group feedback


during instruction) and that grades served accountability and motivational functions (i.e.,

behavioral or attitudinal criteria constituted a large proportion of grades) (Hanzlik, 2001; Hill,

1999; McClung, 1996; Simanton, 2001). In several studies, researchers found that innovative

assessment practices, such as student portfolios, accounted for less than 20% of music teachers’

assessment strategies (Hanzlik, 2001; Hill, 1999; Simanton, 2001). Thus, while national efforts

to enhance assessment practices among music teachers have taken hold to a certain extent,

changes have been slow.

More recently, researchers have found that music teachers tend to focus their classroom

assessment practices on the evaluation of student performance skills, but non-musical criteria are

still emphasized to a considerable extent when assigning grades (Austin & Russell, 2017;

Kancianic, 2006; LaCognata, 2010; Russell & Austin, 2010; Sherman, 2006, St. Pierre &

Wuttke, 2017). The disconnect between policy efforts to improve how music teachers assess and

grade their students, and their actual practices, is complex. Music teachers experience the same

policy pressures as their peer educators outside of music, but these pressures are compounded by

unique features within their jobs: community expectations for performances; program history

and enrollments; music teachers working with many of the same students over multiple years

(i.e., social consequences affecting and arising from assessment); and competition for resources

and time. These issues frame the beliefs music teachers may hold about assessment. Some

researchers have cited the importance of autonomy and its impact on teacher beliefs about

assessment (Box et al., 2015; Fulmer et al., 2014; Simanton, 2001), while others attribute

deficient practices to a lack of adequate training in assessment (Austin & Russell, 2016; Russell

& Austin, 2010; St. Pierre & Wuttke, 2017). Other researchers have cited participant

philosophies about the purpose of a music education (i.e., music classes should be focused on


non-academic outcomes; fun, enjoyment, community engagement, etc.)(LaCognata, 2010;

Richerme, 2016). This attitude is particularly common amongst secondary music educators, who

often self-identify as directors rather than music educators (Isbell, 2008), and have expressed

feelings that assessment is outside the purview of their role (Denis, 2018).

Definitions & Purposes of Assessment

There is little doubt that some of the confusion surrounding the purposes of assessment

can be attributed to the myriad definitions of assessment. In a 2009 chapter of Assessment

Policy: Making Sense of the Babel, Herman and Baker astutely observed that:

“Assessment, test, measure, instrument, [or] examination metric? Tests or assessments

are instruments used to collect and provide information; they are composed of measures

that can be numerically summarized...A metric is an indicator divided by some other

variable, such as time or cost. Although the term “test” often connotes more traditional

kinds of measures, and assessment a wider array of tasks and item types, we use the two

terms interchangeably.” (p. 176)

Herman and Baker commented upon but a few dimensions of assessment: type, form, and

method. When used interchangeably, it is no wonder confusion abounds. Indeed, a review of

music education research literature confirms there are numerous definitions for assessment.

General education researchers have fared no better in their efforts to clearly explain what

assessment means or entails. When defining assessment, some experts focus on the means by

which learning information may be collected, while others focus on the process or outcome. This

conceptual disparity over what assessment is and does reflects the confusion surrounding its

dimensions (e.g., structure, scale, interpretation, and consequences). In the most general sense

assessment could be defined as any range of methods or processes applied to the pursuit of


evaluating learner performance (Popham, 2009). For the purposes of this study, assessment is

defined as the process, inclusive of any purpose or method, of evaluating student learning before,

during, or after the instructional cycle.

Upon reaching definitional consensus, identifying the desired function for an assessment

is critical. The same assessment tool could be used for multiple assessment purposes. Teachers

utilize assessments for a number of curricular and non-curricular purposes in instruction,

including diagnosing areas of improvement or need for students, placing students in instructional

groups or supplemental programs, assigning grades, providing feedback about progress to

students and parents, controlling student behavior, communicating achievement expectations,

and teaching concepts and skills to students (Airasian, 2004). Researchers and educational

experts have routinely criticized teachers for this “hodgepodge” of academic and nonacademic

purposes informing their classroom assessment practices (McMillan, 2001, 2003, p. 34; Schafer,

1993).

Assessment functions are partially determined by their location in the overall

instructional cycle (Figure 1.2). In a 2015 brief, the U.S. Department of Education defined the

instructional cycle as four recurring components: “selecting an instructional strategy,

implementing the strategy, collecting data on strategy implementation (i.e., assessment), and

analyzing the data and reflecting on the results” (p. 2). This cycle can take place over the course

of a lesson, and/or be nested within a larger curricular design. An assessment -- even the same

assessment tool -- placed at any point throughout this cycle would have a distinct purpose. In a

special issue of Music Educators Journal, Goolsby (1999) defined four primary purposes of

assessment: placement, summative, diagnostic, and formative. All four can be distinguished by


the time in the instructional cycle during which they occur, as well as the kind of educational

decisions for which such assessments are used.

Figure 1.2

The Cyclic Nature of Assessment & Educational Decision Making in Instruction

Placement assessments typically occur prior to instruction and provide the teacher with

information about students’ abilities needed to properly place them within groups. In a music

context, such assessments may include auditions, ensemble seating/part assignments, and seat

challenges. Summative assessments typically occur at the conclusion of a teaching cycle and

provide information about a group or individual’s mastery of the content. Concerts, festivals,

recitals, and other kinds of performances or musical products are classic examples of summative

assessments in music settings. Diagnostic assessments are used to determine where learning

difficulties exist for students, so that teachers can tailor instruction to remedy gaps in knowledge

or skill. For music teachers, this often requires using task analysis to determine which musical

element students are struggling with (i.e., rhythmic, melodic, harmonic, or technical elements of

the music), and devising exercises to strengthen deficiencies. Formative assessments are used


throughout the instructional cycle to provide feedback to students about their progress in meeting

a desired and explicitly stated learning outcome. In the music classroom this may look like

feedback from the teacher during rehearsals, short quizzes, exit tickets, or other checks for

understanding.

More recently, scholars and practitioners have reimagined the purposes of assessment

using the language assessment of learning, assessment for learning, and assessment as learning

(Scott, 2012). This shift in the language about the purposes of assessment captures the parallel

shift in curriculum design toward a student-centered model. In a 2006 guide developed by Earl

and Katz on behalf of the Canadian government, full chapters were devoted to unpacking the

definition of each term, the curricular implications, and assessment practices corresponding to

each purpose. According to Earl and Katz (2006), “assessment of learning refers to strategies

designed to confirm what students know, demonstrate whether or not they have met curriculum

outcomes or the goals of their individualized programs, or to certify proficiency and make

decisions about students’ future programs or placements” (p. 55). Thus, assessment of learning

encompasses the summative and placement purposes, as well as some of the accountability

functions of assessment (Stiggins, 2002). Often, assessment of learning is separated from the act

of teaching and learning within the instructional cycle. It is also inclusive of high-stakes forms of

assessment. With this type of assessment, teachers are responsible for collecting and analyzing

data and awarding grades; according to Scott (2012), such assessment is something that is “done

to students” (Scott, 2012, p. 32).

Assessment as learning “focuses on students and emphasizes assessment as a process of

metacognition (knowledge of one’s own thought processes for students…[and] emerges from the

idea that learning is not just a matter of transferring ideas from someone who is knowledgeable


to someone who is not…” (Earl & Katz, 2006, p. 41). Thus, assessment as learning

acknowledges that learning is flexible and fluid rather than linear and rigid. The role of the

teacher is to facilitate student understanding by designing instruction that allows students to

think about and monitor their own learning (p. 42). Assessment as learning is rooted in

reflection-as-practice, and positions assessment as something that is “done by” students (Scott,

2016, p. 32). While on some level, this is the most aspirational form of assessment, the

consensus among experts, policymakers, and researchers is that assessment for learning

represents the most actionable means to changing teaching practice and enhancing student

learning outcomes.

Assessment for learning “occurs throughout the learning process...it is designed to make

each student’s understanding visible, so that teachers can decide what they can do to help

students progress” (Earl & Katz, 2006, p. 29). This definition encompasses the formative and

diagnostic purposes of assessment, and fully empowers the teacher to make educational

decisions about instruction based on ongoing assessment interpretations and feedback to

students. Assessment for learning has gained traction as a way to reimagine assessment within a

post-high-stakes assessment environment (Hansen, 2019; Stiggins, 2004, 2005; Wiggins &

McTighe, 2006). This orientation positions assessment as something that is “done for” students

to enhance instruction and learning (i.e., educative assessment), positions teachers as

collaborative facilitators, and embeds assessment practice directly in instruction, rather than as a

stand-alone event (Scott, 2016, p. 32).

Ideally, teachers utilize a variety of assessment strategies and purposes throughout the

course of instruction. No single purpose of assessment should supersede another; balance is

critical to meeting the needs of students, teachers, administrators, and other educational


stakeholders. Educational experts have established the motivational and engaging benefits of

assessment for learning (Earl & Katz, 2006; Stiggins, 2004, 2005). Assessment of learning has

been used to determine proficiency throughout a students’ K-12 and higher education experience

for decades, and to make important decisions about promotion, credentialing, and the quality of

instruction being provided to students. Assessment as learning represents a newer perspective --

one of empowering students as co-constructors of learning and understanding in the classroom

(Earl & Katz, 2006). Collectively and ideally, the well-balanced adoption of these varied

assessment functions meets the multifaceted needs of all educational stakeholders.

Assessment Literacy

Researchers have shown that basic knowledge of assessment is not sufficient to change

assessment practice. That is, teachers can be generally knowledgeable about what kinds of

assessments are available to them, but not make connections between that knowledge and the

instructional practices that follow. Understanding both the components and processes of

assessment constitutes procedural knowledge or assessment literacy. Assessment literacy has

been defined as “an understanding of the principles of sound assessment” (Crusan et al., 2016, p.

43). Stiggins stated that “assessment literates know the difference between sound and unsound

assessment...they are not intimidated by the sometimes mysterious and always daunting technical

world of assessment” (cited in Mertler, 2009, p. 102). Functionally, assessment literate “teachers

must not only be competent to develop and use high-quality authentic assessments and scoring

rubrics, but also be able to master evaluative skills to make sound judgments about student

performance” (Koh, 2011). Most importantly, assessment literacy is contextually situated; Willis

et al. (2013) explained:


“Assessment literacy is a dynamic context dependent social practice that involves

teachers articulating and negotiating classroom and cultural knowledges with one another

and learners, in the initiation, development, and practice of assessment to achieve the

learning goals of students” (p. 2).

Thus, to be assessment literate is to be both fluent and adaptable in the knowledge and uses of

assessment within a specific context.

Assessment literacy is a transferable knowledge base and set of skills that may be

developed by both preservice and inservice teachers. Ryan (2019) adapted the Classroom

Assessment Literacy Inventory (CALI) to include parallel measures of assessment literacy and

confidence. Ryan found significant relationships between preservice teacher GPA and

assessment knowledge, confidence, and edTPA assessment task ratings. This suggests that there

may be a link between assessment literacy and confidence in affecting teacher practice, as

preservice teachers are required to demonstrate assessment competencies through five rubric-

based evaluations associated with the edTPA assessment task. What remains unclear is the

degree to which confidence about assessment literacy reflects ongoing teacher learning about

assessment as opposed to existing practices or beliefs.

Assessment Beliefs

Some researchers posit that increasing assessment literacy in the teacher workforce is not

sufficient to effect change in their practices (Ludwig, 2013; Ryan, 2019; Xu & Brown, 2016).

Researchers have suggested that teacher beliefs about assessment may be as important as their

assessment knowledge (Barnes et al., 2017; Brookhart, 2011; Deenan & Brown, 2016). The ways

teachers conceptualize assessment (i.e., what assessments are, and their purposes), and their

feelings surrounding assessment (i.e., value judgements, past experiences, and preferences), can


directly impact their educational decision making (Deenan & Brown, 2016). To illustrate, a

teacher who is very knowledgeable about assessment methods, but who assigns a lower priority

to assessment than planning and instruction, may opt for assessment methods employed with the

least amount of effort, even if they are not reasonably effective.

Assessment conceptions encompass “teachers’ general views about what assessment is

and its purposes in school and in society” (Fulmer et al., 2015, p. 479). Researchers have found

that teachers conceive of assessment in three broad ways: for instructional feedback (i.e.,

formative feedback to student, instructional decision making for the educator), for accountability

(i.e., of students, in grading practices, of their teaching, etc.), and as irrelevant (i.e., high-stakes

standardized tests that do not provide disaggregated data at the classroom level, testing that takes

away from instructional time, etc.)(Opre, 2015).

The terms beliefs and conceptions are used interchangeably within the literature;

however, beliefs are understood to be inclusive of conceptions (Fulmer et al., 2015). Assessment

beliefs comprise the values, conceptions, and attitudes teachers hold; they “merge affect and

concept” (William, 1979 as cited in Fulmer et al., 2015, p. 478). Teachers’ assessment beliefs

may also be linked to confidence in their knowledge and skills. Austin and Russell (2019)

collected data about preservice music teachers’ (N = 75) confidence in their ability to properly

assess K-12 music students. They found that preservice teachers with more assessment training

felt more confident in their knowledge and skills. Ryan (2018) collected information about

preservice teachers’ confidence in their assessment knowledge but found that those with the least

knowledge were typically the most confident. Ludwig (2013) collected data from 160 teachers

about their assessment confidence and beliefs. Ludwig also found that the most well-trained and

confident teachers tended to have the most positive conceptions of assessment. While


recognizing that confidence and positive conceptions may not translate into implementation of

sound assessment practices, Ludwig concluded that increasing assessment literacy in teachers,

and providing opportunities for teachers to collaborate and reflect on the conceptions, were

necessary to improve student involvement in assessment (2013, p. iv). Consequently, it may be

important to consider how teachers’ beliefs about assessment, in tandem with their assessment

literacy, inform assessment practices.

Teacher Preparation and Development

Despite the development of the Standards for Teacher Competence in Educational

Assessment of Students (STCEAS), a shift toward standards based-curriculum, and political

support for accountability-based educational reform -- all of which were intended to elevate

teachers’ assessment competencies -- education expert Stiggins argues that these efforts

“paradoxically, have been barriers to developing assessment competence in the classroom”

(2014). One such barrier is the slow pace of teacher preparation programs in making substantive

changes to assessment curriculum amidst competing space for courses tied to teacher

certification and licensing (Darling-Hammond et al., 2002; DeLuca & Klinger, 2011; DeLuca et

al., 2010; Gareis & Grant, 2015). Mertler noted, “ironically, in this age of increased emphasis on

testing and assessment, many colleges of education and state education agencies do not require

preservice teachers to complete specific coursework in classroom assessment” (2004, p. 50).

While teachers are pressured to increase student learning outcomes, and evidence suggests that

effective teacher-constructed assessments can contribute to significant gains on standardized

tests (Mertler, 2004), preservice teachers are not provided adequate training. It is no small

wonder that, “beginning teachers continue to feel unprepared to assess student learning”

(DeLuca & Bellara, 2013, p. 357 in Gareis & Grant, 2015).


Some teacher preparation programs have responded by developing stand-alone courses

for minimal credit (Austin & Russell, 2016, 2019; Gareis & Grant, 2015). The effectiveness of

such additions to the teacher education curriculum has not been conclusive. Gutierrez (2014)

found that “the amount of teacher training explain[ed] 17% of the variability in classroom

assessment practices, while teachers’ assessment knowledge explain[ed] 38% of such variability

in assessment practices…the amount of teacher training did not significantly predict teachers’

assessment knowledge” (p. 4). Clearly, beyond offering units or entire courses focused on

assessment to preservice teachers, it is important that the instruction associated with those

offerings be of a high enough quality to enhance both assessment knowledge and practical

application of that knowledge in assessing learning. Undergraduate music education curricula

are characteristically dense, with few programs offering formal training in assessment principles

(May et al., 2017). Austin and Russell (2019) found that preservice music teachers who received

a greater amount of assessment education valued assessment more and were more confident in

their assessment abilities, but still anticipated using assessment, at least in part, to target and

document non-musical outcomes (e.g., behavioral compliance, rehearsal attendance, punctuality,

citizenship, etc.). Few students had experience implementing assessments in their field

placements, however, which may have limited their ability to conceive of or appreciate how

assessments might be used to target and promote music learning.

Given that undergraduate music education majors typically report that few (if any) class

sessions are devoted to assessment throughout their teacher preparation course work, graduate

study might be considered a stopgap or compensatory means of improving music teacher

assessment literacy, confidence, and practices (Austin & Russell, 2019). Yet, there are no

guarantees that music teachers engaged in graduate education will have access to high-quality


assessment courses. Austin & Russell (2016) estimated that fewer than half of all students

pursuing master’s degrees in music would complete an entire course on assessment; only 58% of

institutions reported offering stand-alone courses, and only 72% of those that offered an

assessment required graduate students to take the course. Thus, it may be important to consider

how professional development programming could be used to make assessment education

accessible to more music teachers, and whether this instructional format (which would

necessarily be less comprehensive and more condensed than a dedicated course) might be

suitable for enhancing assessment literacy, beliefs, and practices. Additionally, education reform

efforts and trends tend to shift faster than teacher preparation and graduate education programs

may be able to respond. While providing foundational understanding of effective assessment

principles should be an attainable objective for teacher preparation programs, ongoing education

is necessary to keep assessment principles and practices current and relevant to inservice

teachers.

Professional Development and Assessment Literacy

One possible avenue for addressing teachers’ assessment literacy, beliefs, and practices is

through inservice teacher professional development. Professional development, like assessment

itself, has proven difficult to define given the “broad-based assumption that teachers already

know what professional development is” (vanOostveen et al., 2019, p. 1876). Generally,

researchers ascribe desired outcomes for professional development, or compare it to other

professions’ development activities. Because teachers are continually developing instructional

competencies, some researchers have shifted from the language “professional development” to

“job-embedded professional learning” (Zepeda, 2019, p. 3). Regardless of terminology,


professional development can be defined as any range of learning opportunities provided for

teachers to improve instructional practice in the service of student learning outcomes.

Professional development has traditionally been delivered in a face-to-face format

(McConnell et al., 2013). Teachers now utilize novel online information and communication

technology (ICT) for professional development opportunities (Wasserman & Migdal, 2019).

Whereas traditional formats often require all participants to meet for a short period of time in a

prescribed location -- which may present logistical obstacles for many teachers -- online formats

provide teachers from different districts and schools the chance to engage in collaborative

professional learning communities (PLC) at a distance. Further, the kinds of activities that

teachers can engage in through online professional development are often unique in comparison

to face-to-face formats. For example, vanOostveen et al. (2019) investigated the efficacy of

Professional Development Learning Environments (PDLEs; a series of learning tasks and video-

based case studies) in online professional development and found evidence of “some effect on

beliefs about personal theories of learning” (p. 1864). Boling et al. (2011) found that online

professional development was most effective when designers eschewed traditional (i.e., face-to-

face) activities, and fully embraced the pedagogical opportunities of digital formats.

Online professional development is also a potentially viable option for alleviating teacher

concerns of relevancy and effectiveness. One of teachers’ primary concerns is that professional

development may not address their unique contextual factors (i.e. school- or district-level factors,

content area, and experience level of the teachers) (Guskey, 2003, 2009). Often, face-to-face

formats (e.g., workshops, conferences, coaching, professional communities, etc.) necessitate

presenting educational information in a general manner that is incompatible with teachers’

personal needs (Cook et al., 2017). Online professional development holds the potential for more


individualized and longer-term engagement with teachers than periodic or short-term face-to-

face formats.

During development of the STCEAS, the collaborating organizations emphasized that

teacher assessment literacy must be cultivated during preservice and inservice teacher education

(AFT et al., 1990). Wang et al., (2008) described four models for developing teacher assessment

literacy ranging from face-to-face training to graduate coursework for credit. As a test of one

model, they implemented an online training system for preservice teachers to improve

assessment literacy and found that those in the treatment group improved their assessment

knowledge and conceptions (changes in actual assessment practices were not considered). In

2001, staff in Lincoln (Nebraska) Public Schools sought assistance in developing ongoing

professional development in assessment literacy for teachers (DeLuca et al., 2004). The

subsequent program, the Assessment Literacy Learning Team (ALLT), was implemented with

approximately five percent of teachers. This program was unique in that teachers’ literacy,

confidence, and the quality of their subsequent assessments were appraised, as well as student

attitudes about themselves as learners. Researchers reported that teacher assessment literacy and

confidence, and quality of classroom assessments improved, as did student beliefs and attitudes.

Huai et al. (2006) conducted a quasi-experimental study with 55 teachers from Arizona,

South Carolina, and Wisconsin to evaluate the effectiveness of an online professional

development program, Assessing One and All, in increasing teachers’ assessment literacy and

practices. Teachers completed the three-month online course, maintained journals, and kept

records of any other professional development they attended for the duration of the intervention.

Measures of assessment literacy and knowledge of assessment practices, obtained prior to and

following the course, were compared. Huai et al. found that “the multimedia, web-based AOA


course was effective in improving participants’ knowledge and self-efficacy with regard to

general and inclusive educational assessments” (p. 257). This study was unique in that it

evaluated an exclusively online professional development program and required teachers to

maintain journals of their experiences.

Long-term or intensive interventions may be key to changing teacher assessment

practices. Mertler (2009) designed a two-week long intervention for inservice teachers in a

mixed methods study. He found that the intensive was effective in increasing assessment literacy

scores, as well as improving teacher perceptions of assessment as reported in their journals. Like

Wang et al., Mertler did not examine whether the intervention changed teacher assessment

practices in the long term. Koh (2011) investigated whether ongoing and sustained professional

development activities or short-term, one-shot workshops were more effective in cultivating

increased assessment literacy among teachers. Koh directed professional development for two

groups of teachers randomly assigned to each condition. While teachers receiving one-shot

workshops demonstrated improved performance on assessment tasks in the near-term, teachers in

the sustained professional development group had higher levels of assessment literacy one year

later.

Study Need and Significance

Teachers face increasing pressure from political, societal, and local forces to provide

evidence that students are learning at an appropriate level (Colwell, 2008; Mertler, 2009;

Nierman & Colwell, 2019; Stiggins, 2014). Past emphasis on the accountability purposes of

assessment, however, coupled with inadequate assessment training, have left teachers skeptical

and uninformed as to how assessment might be used to improve teaching and learning at the

classroom level (Stiggins, 2014). Teachers also use assessments for purposes other than


providing feedback or measuring achievement (McMillan, 2003), which undermines the integrity

of the assessment process and diminishes trust in teachers and the information they share about

student learning. Teachers’ educational decision making is a complex process that encompasses

their knowledge, beliefs, expectations, values, their classroom contexts, and external factors

(e.g., local and state policies, parents, demands from administration, etc.) (Figure 1.3). Their

subsequent assessment practices reflect the intersecting – and potentially conflicting – forces at

play. To date, researchers in music education have primarily focused on music teachers’

assessment practices, and devoted nominal attention to music teachers’ assessment literacy and

beliefs. General education researchers have yet to explore the interaction of assessment literacy,

beliefs, and practices, although it is an often-cited implication for future research efforts (Fan, et

al., 2011; Mertler, 2009; Quilter & Gallini, 2000).

Figure 1.3

Teachers' Classroom Assessment Decision Making

Adapted from McMillan (2003)


Possible solutions to increasing assessment literacy among teachers include adding

assessment coursework to teacher preparation programs (DeLuca & Klinger, 2010), or enhancing

inservice teacher assessment literacy through professional development (Huai et al., 2006; Koh,

2011; Mertler, 2009; Wang et al.,, 2008). In research addressing music teacher perceptions of

preservice coursework, participants have reported poor connections (i.e., a decontextualized

experience) between theory and practice (Conway, 2002, 2012). General education researchers

have also noted that the “combination of theoretical and practical study is a particularly

important change from the traditional approach, which front-loads theory, does not enable

applications, and therefore does not support grounded analysis of teaching and learning”

(Darling-Hammond, 2006, p. 154). The same criticism can be leveled toward poorly constructed

professional development. If opting to enhance teacher assessment literacy through professional

development, integrating the theoretical and practical applications in teachers’ classrooms is

critical to grounding their confidence and beliefs.

To date, efforts to expand inservice teacher assessment literacy through professional

development have consisted mainly of short-term face-to-face interventions (Lukin et al., 2004;

Mertler, 2009), online interventions (Koh, 2011), or a mixture of the two (Wang et al., 2008). All

have reported varying levels of success in enhancing teachers’ assessment literacy, while only

some have attempted to measure teachers’ confidence in their assessment abilities, beliefs about

assessment, or changes in assessment practices. Given that teachers’ assessment practices are a

complex sum of teachers’ knowledge, beliefs, and contexts, providing professional development

that addresses all these components appears vital to changing educational decision-making

processes long-term. Such an endeavor will not be without challenge. Teachers’ educational

decision-making processes are impacted by the complex and recursive interaction of their


philosophical beliefs, logistical challenges and realities, and external social and political

pressures (McMillan, 2003). Yet, providing professional development targeted toward enhancing

— even incrementally — music teachers’ assessment literacy and practices, while engaging them

in reflection about their beliefs, may have an impact on their subsequent educational decision

making.

This study was unique and significant in several ways. First, I investigated the

intersection of assessment literacy, beliefs, and practices; as previously mentioned, researchers in

general and music education have yet to simultaneously explore these concepts. I also adapted a

measure of assessment literacy — the Classroom Assessment Literacy Inventory by Mertler

(2000) — to a music education context, within an experimental research design that allowed me

to make statistical comparisons across conditions and time. Further, the use of an intervention to

enhance music teachers’ assessment literacy, beliefs, and practices is a novel approach that has

not previously been employed in music education research.

Purpose & Research Questions

The primary purpose of this study was to examine the impact of an online professional

development intervention on music teachers’ assessment literacy, beliefs, and practices. The

intervention was designed to improve music teachers’ classroom assessment literacy, beliefs, and

practices via a four-week online, module-based course. Assessment literacy was measured using

an adapted version of Mertler’s (2000) Classroom Assessment Literacy Inventory (CALI). To

measure assessment practices, as represented by the forms and functions of assessment most

frequently implemented in the classroom, participants responded to a researcher-devised measure

(the Music Teacher Assessment Implementation Inventory). Finally, to evaluate music teacher


beliefs, participants indicated their level of agreement with 17 statements comprising the Music

Teacher Assessment Beliefs Inventory (MTABI).

The research questions for this investigation were:

1. Does a four-week online professional development intervention have a significant effect

on music teachers’ assessment literacy?


on music teachers’ assessment beliefs?


on music teachers’ assessment practices?

4. Are there significant relationships between music teachers’ assessment literacy, beliefs,

and practices?

Hypotheses

Based upon the literature, I formed directional hypotheses about how these variables

would evolve throughout the study. With regards to assessment literacy, I anticipated that music

teachers would score poorly across groups (i.e., control and intervention) on the pretest, and that

those in the intervention group would show growth; Mertler (2002; 2009) demonstrated that

teacher knowledge is relatively malleable with direct and/or sustained intervention. I also

hypothesized that music teachers’ beliefs about assessment would reflect the findings of Austin

and Russell (2017); those with less knowledge or experience with assessment would tend to

value it less than peers with more knowledge or experience. Researchers have continually

reported -- and lamented -- widespread use of informal assessment, or behavioral and

accountability uses of assessment, within the music teaching profession (Hanzlik, 2001; Hill,

1999; Kancianic, 2006; LaCognata, 2010; McClung, 1996; Sears, 2002; Simanton, 2001).


Consequently, I anticipated that music teachers would confirm prior findings before the

intervention. I hoped that the intervention group would show increased usage of formal

formative assessments (e.g., written assignments, projects, portfolios, individual playing

assessments), and less reliance on behavioral measures such as attendance and participation. I

also hoped that those in the intervention group would see the utility of assessment for serving a

wider variety of functions in the music classroom, such as placement and diagnostic functions,

rather than a reliance on summative and extramusical functions. Collectively, I anticipated that

these findings would lend credence to the conceptual model McMillan (2003) designed; by

demonstrating that changes to music teachers’ knowledge and beliefs surrounding assessment

could be leveraged to change practices.

Definitions

For the purposes of this study, I defined the most salient terms and constructs; these

definitions reflect their use within education and assessment research literature. Assessment

refers to the process of gathering or eliciting information about student learning, and any specific

methods for doing so. This definition accounts for the various dimensions of assessment: its

structure (e.g., informal, formal, standardized), the scale (e.g., classroom-level, state-level,

national-level), format (e.g., traditional or alternative), purpose (e.g., diagnostic, placement,

formative, or summative), interpretation criteria (i.e., in relation to prior achievement, standards,

or norms), and the consequences or outcomes attached to results (i.e., no stakes, low-stakes, or

high-stakes).

Assessment literacy is the adaptable knowledge of processes and methods used to

evaluate student learning, as well as the practices best suited for specific learning contexts. It is

not enough to know the processes and methods associated with assessment; teachers must also be


able to use them to make sound educational decisions in a variety of contexts. Assessment

literacy “involves the understanding and appropriate use of assessment practices along with the

knowledge of the theoretical and philosophical underpinnings in the measurement of students’

learning” (DeLuca & Klinger, 2010, p. 420). This distinction -- between knowing and using -- is

a core tenet of modern cognitive theory (Wiggins & McTighe, 2005).

Assessment beliefs are the conceptions and values teachers hold about assessment.

Conceptions include teachers’ knowledge about what assessment is and personal views about

how it should be used. Teachers’ beliefs about the fundamental purposes and goals of assessment

may influence their practices, as well as their responses to efforts designed to alter those

practices. Examining teachers’ assessment conceptions in tandem with their assessment literacy

may prove integral to changing teacher assessment practices in the long-term.

Assessment practices encompass the specific forms and functions for which music

educators gather information about student learning, Researchers in general education and music

education have examined both the specific forms (e.g., written assessments, individual

performance tasks, group performance tasks, etc.) and the purposes for which teachers employed

assessments (e.g., formative, summative, accountability, etc.). Because there is not definitional

consensus, I elected to collect information about music teachers’ use of both the forms and

functions for which they use assessment in their classrooms.

Professional development encompasses any range of activities and methods geared

toward improving teacher practice in the service of improving student learning outcomes. Thus,

professional development can occur in face-to-face or online formats. For the purposes of this

study, professional development consisted of an online intervention delivered over the course of

four weeks. There were four modules, each one week in length, corresponding to the first four


STCEAS competencies. These standards state that teachers should be skilled in (a) choosing

appropriate assessment methods, (b) developing appropriate assessment methods, (c)

administering, scoring, and interpreting externally produced and teacher-produced assessments,

and (d) using assessment results for educational decision making.

Delimitations

The results of this investigation were examined with the following delimitations in mind.

Assessment Literacy

The Standards for Teacher Competence in Educational Assessment of Students (STCEAS)

encompass seven unique competencies that teachers must possess to be considered assessment

literate. However, it is evident from the general and music education literature that many

teachers struggle with the first four standards, and that these are the standards most closely

associated with classroom-level assessment practices that may have a meaningful impact on

student outcomes. Thus, the first four standards were the only ones addressed in the online

professional development course and adapted CALI measure. The CALI was also selected

because it presents questions in the form of realistic vignettes that participants evaluate; rather

than measuring inert knowledge about assessment, respondents must utilize procedural and

applicable knowledge about assessment. This most closely embodies the skillset associated with

the STCEAS and assessment literate teachers.

Sampling

Via email, I solicited participants from a nationwide population of approximately 20,000

music teachers holding NAfME membership, who presumably were technologically capable

enough to access and complete the study. I recognized that NAfME membership is not

necessarily representative of the entire music teacher profession. However, I solicited


participants from a nationwide sample in hopes that the analyses would be adequately powered

for the pretest-posttest control group design and number of variables. This study was conducted

during the COVID-19 pandemic in the spring of 2020; this may have played a significant factor

in participants’ decision not to engage with the study, or to drop out. My final participant sample

was 43 music educators; 18 in the intervention group, and 25 in the control group.

Measures

I chose to measure assessment literacy using a pre-existing quantitative questionnaire

adapted to align to the first four STCEAS and music teacher contexts. While there were other

measures and strategies available, I selected the Classroom Assessment Literacy Inventory

(CALI) specifically because of its (a) alignment to the STCEAS, to which I also aligned the

modules of the intervention; (b) use of vignettes to frame questions, thus requiring respondents

to apply authentically apply their knowledge; and (c) it is the most widely utilized measure by

researchers who have examined assessment literacy in inservice and preservice teacher

populations. After reviewing the literature, I also decided to construct my own instrument to

measure the frequency with which music teachers self-reported utilizing assessments in specific

forms, and for specific functions. While there are some noted flaws with collecting self-report

data about assessment practices, direct observation and computation of music teachers’

assessment practices was not viable. Finally, I utilized a pre-existing and validated measure, the

MTABI, from Austin and Russell (2017) to measure music teachers’ beliefs about the purposes

and value of assessment. This measure did impose an a priori conceptualization of assessment

beliefs, in the interest of maintaining a reasonable amount of work for intervention participants; I

did not utilize more exploratory strategies (e.g., interviews and journaling prompts).


Researcher Interest

I would be remiss if I failed to express my own interests in conducting this study. I

pursued a career in education due to the influence of my mother. She is a middle school librarian,

STEM teacher, and state assessment coordinator. She started teaching over twenty years ago,

halfway through my childhood. I assisted her during many summers; setting up her classroom,

moving desks into nonsymmetrical (to my chagrin) learning centers, alphabetizing student

folders, and making copies. Little did I know I was watching her explore the possibilities of

Deweyian learning concepts in her own teaching. She fully embraced project-based learning

early in her career, and incorporated assessment as learning principles into her instruction. She

was truly a model educator for assessment. And I, unknowingly, was learning from a master

teacher.

When I began my undergraduate career, Mom and I had many conversations about the

role of the teacher in directing (my words) and facilitating (her words) instruction. My decision

to become a music teacher brought me, unexpectedly, to a precipice. I had spent 15 years

watching my mother teach and assess in a student-centered way. Yet, by choosing to study

music, I was confronted with the ways that I had witnessed my former and current teachers teach

and assess, and experienced cognitive dissonance. Rehearsals had always been teacher centered.

Assessment, if it happened at all, appeared to only exist as a box to check at the end of a marking

period or semester, and usually to appease administrators. “We assess constantly” my professors

and cooperating teachers told me. True, the informal and formative strategies that I learned to

name during my studies (i.e. error detection, pedagogy) are a type of assessment. But I had seen

what a master teacher can do with assessment, and how powerful it could be for student learning.

My mother was able to elevate second graders and middle schoolers.


During my first few years of teaching I, like many young teachers, reverted to the

comfortable routines of what I knew from my experiences as a music student. Rehearsals

resembled those that I had participated in as a middle school, high school, and university student.

While my students did get better at their instruments, sing in tune, march in step, and play piano

with appropriately raised and relaxed wrists, I didn’t see evidence that they could carry these

skills outside of my classroom without being directed. I had taught them to rely on needing me to

be musical. They were terrific musicians, but their musicianship was inert.

Over the next few years I continued talking to my mother about teaching. I attended

professional development. I read about teaching music from experts, including an article on

assessment from my future advisor and mentor James Austin. I started adapting the principles I

had witnessed in my mother’s classroom. I grew less afraid of losing control of my classroom

and using student voice and choice to frame my rehearsals. I had students devise criteria for

evaluating performances, self-evaluate, and demonstrate growth (not just proficiency!) in their

musicianship. I used project-based learning to encourage students to arrange their own music;

some of it we even performed in concert. Student performance on my traditional paper-and-

pencil summative assessment even increased. I found students slowly taking their skills out of

the classroom. I know they had musical lives before, but now they could use the skills they had

acquired in class to enhance their musical lives out of class. By expanding my assessment palette

from the default informal strategies that I had witnessed and been taught, my classroom became

a colorful space for transformational learning.

I believe in the power of effective assessment. I know music teachers lead unappreciated

and thanklessly busy professional lives. I know that high-stakes testing and assessment has left a

sour taste in the mouths of all teachers, especially music teachers who have been robbed of


already scarce instructional and planning time. I know that professional development can often

feel like an administrative hoop to jump through, and that trends in education often change faster

than administrators change schools. I still believe in the power of effective assessment. The key

to changing teacher practice is a combination of building knowledge, building confidence, and

giving teachers space to experiment and collaborate. I was fortunate to have my mother down the

hall for the first few years of my career. I was lucky to have a master teacher as my mother. So, I

believe that I can help other music teachers see its value, too.


Chapter 2

Review of Related Literature

Music teachers’ assessment practices often reflect a lack of awareness regarding

assessment principles designed to promote and document learning in an effective manner at the

classroom level. This problem may be due, in part, to teachers’ negative experiences -- both as

educators and former students -- and their association of assessment with high-stakes,

standardized testing. Many music teachers also report having received inadequate training in

assessment as part of their teacher preparation programs, resulting in gaps in their knowledge

base and skill set (i.e., lack of assessment literacy). Researchers have found that teachers’

assessment practices may be influenced by their beliefs about assessment (Harris & Brown,

2009), including the major purposes that assessment should serve, if assessments provide a

trustworthy basis for making educational decisions (Olsen & Buchanan, 2019), and whether

assessment is even appropriate where music making and learning are concerned (Denis, 2018).

Professional development may prove a viable avenue to educate inservice teachers about

assessment; online professional development can be highly efficacious, in part, due to the unique

delivery formats available in digital platforms. In this study, I provided a four week online

professional development intervention to a voluntary sample of music educators. Through the

use of an intervention design, I was uniquely situated to explore the impact of the intervention on

music teachers’ assessment literacy, beliefs, and practices.

Assessment is a vast topic that scholars have examined from numerous vantage points.

To include the sum of assessment research literature in this chapter would be unfeasible.

Therefore, for the purposes of this study, I delimited the range of literature to scholarly, peer-

reviewed articles published after 1990. In addition, I delimited research in general education to


articles addressing the assessment literacy, beliefs, and practices of teachers broadly (i.e., not

content area-specific). Comparably fewer music education scholars have examined the

assessment literacy, beliefs, and practices of inservice music teacher populations; therefore, I

have occasionally included articles that were topically relevant, but that described preservice

teacher populations. The body of research reviewed in this chapter was organized into three main

sections according to major research outcomes for this study: (a) assessment literacy; (b)

assessment beliefs; and (c) assessment practices. Each section includes subsections for

assessment research focused on teachers in general education and research focused on teachers in

music education, as well as a section summary.

Assessment Literacy

Since the publication of the STCEAS in 1990, researchers have developed numerous tools

to measure assessment literacy in inservice teacher populations. The STCEAS have been used as

a benchmark for establishing the content validity of a number of assessment literacy measures,

including those developed by Impara et al. (1993) and Mertler (2001, 2004, 2009), and

assessment conceptualization measures by Brown and colleagues (2004, 2006, 2011, 2012,

2015). Critics of these measures caution that the STCEAS are not inclusive of all the factors that

inform teachers’ assessment practices (Alkharusi, 2015; Brookhart, 2011; Hailaya et al.,. 2014;

Stiggins, 1999). In the following section, I will chronologically explore the development,

validation, and use of assessment literacy measures in both general education and music

education contexts.

...of Teachers.

Impara et al. (1993) were amongst the first educational researchers to examine teachers’

assessment literacy. They developed the Teacher Assessment Literacy Questionnaire (TALQ),


and asked teachers to respond to items about both preservice and inservice assessment literacy

training, their assessment practices, their comfort level in interpreting standardized test

information, assessment preferences, and interest in increasing their assessment literacy. Using a

national sample of 555 teachers from 42 states (47% response rate), they found that teachers

harbored positive feelings about the role of classroom assessments, but less positive feelings

about standardized instruments. Impara et al. reported 46% of teachers did not feel comfortable

interpreting information from standardized tests. While 70% of teachers reported having some

training in testing and measurement, 30% reported no training at all. Further, 59% of teachers

reported a preference for training to occur as part of their inservice experience. Perhaps most

interestingly, the researchers found that teachers who were least interested in becoming

assessment literate were also the least confident in their abilities to assess (p. 116). This study

provided researchers -- such as McMillan -- with a foundational understanding of the concepts

and issues surrounding assessment literacy, and teachers’ general beliefs about their own skills.

Using the STCEAS as a basis for measuring assessment literacy competencies, Mertler

(2004) devised and administered the Classroom Assessment Literacy Inventory (CALI) to a

sample of 67 preservice and 10 inservice teachers. The CALI consists of 35 items (5 items per

STCEAS standard, presented via vignettes in consecutive order), and was adapted from Impara et

al.’s (1993) TALQ. Mertler conducted descriptive analyses and t-test comparisons of preservice

and inservice teachers’ mean scores for each of the seven subscales, as well as the total score for

the instrument. In all cases, where there were significant differences in scores between inservice

and preservice teachers, inservice teachers scored higher. Both groups scored an average of 22

items out of 35 correctly. Mertler reflected that “traditional teacher preparation courses in

classroom assessment are not well matched with what teachers need to know for classroom


practice” (p. 60). He acknowledged that there were limitations with regard to sampling and

generalizability, and the internal consistency of the assessment literacy measure.

Building upon the 2004 study, Mertler and Campbell (2005) revised the CALI and

renamed it the Assessment Literacy Inventory (ALI), intending to evaluate preservice teacher

populations. This instrument was also based upon the STCEAS, but was organized differently

than the CALI in that there were only five vignettes (i.e., one item per standard, per vignette).

This investigation included a two-stage pilot of the ALI with 152 preservice teachers in the fall of

2003, and 249 preservice teachers in the spring of 2004. Mertler and Campbell found increased

internal consistency (α = .74) in comparison to the CALI for this population. As in the 2001

study, Mertler and Campbell found that preservice teachers answered an average of 23 out of 35

items correctly; they considered this result puzzling – and lower than expected – given preservice

teachers’ recent coursework. They concluded that “because the ALI is specifically designed to

measure the real-world application of assessment concepts and competencies outlined in The

Standards, limited familiarity and experience with the day-to-day realities of the classroom may

have precluded preservice teachers from making necessary connections” (p. 13). They also

concluded that researchers should evaluate the psychometric properties of the ALI with inservice

teacher populations.

Four years later Mertler (2009) used the ALI with a small sample of inservice teachers.

Using a mixed-methods intervention study, Mertler administered the ALI as a pre- and posttest

measure to seven elementary school teachers as part of an intensive two-week professional

development inservice intended to increase assessment literacy. In addition to the ALI, Mertler

used journal prompts to corroborate quantitative results regarding perceived assessment literacy

and growth throughout the intervention. Topics in the two-week intensive included “norm- and


criterion-referenced measurements, validity and reliability of assessments, the integration of

teaching and assessment, construction and use of traditional assessments, construction and use of

authentic assessments, grading, and the interpretation of standardized test results” (p. 104).

Participants completed nine assessment tasks in the two-week intensive. Mertler found that

teachers’ ALI scores increased from pretest to posttest, but did not analyze if such changes were

statistically significant due to the small size of the sample (N = 7). The largest changes were for

standards 5 (“Teachers should be skilled in developing valid pupil grading procedures which use

pupil assessments”; + 2.00 out of five) and 2 (“Teachers should be skilled in developing

assessment methods appropriate for instructional decisions”; +1.86 out of five). Through the

journal prompts, teachers revealed that their assessment knowledge was initially limited, but that

the intervention was “highly beneficial to their work” (p. 111). Teachers also found the intensive

format of the professional development was helpful because it required them to put aside time to

deliberately consider their current understanding and practices. Mertler did acknowledge the

limitations of this intervention study, including the limited sample size, and generalizability of

the results to different groups of teachers. However, Mertler concluded that “performance-based

inservice teacher training sessions, which focus on applied assessment decision-making, could

prove to be beneficial to a majority of classroom teachers” (p. 112).

One of Mertler’s recommendations for future researchers was to validate the use of the

ALI with inservice teacher populations. Hailaya et al. (2014) surveyed 582 inservice teachers

using the ALI in order to validate the instrument, using Rasch and confirmatory factor analysis

techniques. The researchers made slight adjustments to the vignettes used in the ALI, mainly to

suit prospective Tawi-Tawi and Philippine respondents and their context. They piloted the

measure with 45 elementary and secondary teachers and found acceptable measures of internal


consistency for the subscales (α = .75). Then, they administered the instrument to the selected

sample of 582 teachers (100% response rate). After removing a single item for violating

assumptions required for the Rasch model, they constructed a confirmatory factor analysis. They

found that the model had appropriate item-level fit and that “all items are appropriate in

measuring teacher assessment literacy and reflect the unitary dimension of the scale pertaining to

assessment literacy” (p. 305). With regards to how the items loaded onto the subscales (n.b., each

subscale corresponding to a standard in the STCEAS), both the Rasch and CFA results indicated

poor fit due to overlapping items. Hailaya et al. had three primary conclusions: (a) there was an

absence of hierarchy among items and factors; (b) factors and standards could not be used

interchangeably in interpreting the validity of the instrument; and (c) the ALI may have other

factorial or structural variance across cultures. They asserted that the STCEAS standards upon

which the ALI is based may not be “sufficiently comprehensive” to assess teachers’

understanding of assessment as it relates to the “realities of the classroom” (p. 312). They also

noted that modern statistical techniques may not be appropriate for interpreting the validity of an

instrument created using classical testing theory.

Similarly, in 2015 Alkharusi examined the psychometric properties of the Teacher

Assessment Literacy Questionnaire (TALQ), upon which the CALI and ALI are based. Alkharusi

administered the TALQ to 259 preservice teachers enrolled in an assessment course in Oman

after translating the instrument into Arabic. He found that the TALQ items demonstrated

“acceptable levels of difficulty, discrimination, reliability, and validity” (p. 1), measured a

unitary construct of assessment literacy, and correlated positively with course scores. Construct

validity (unitary model fit) was measured using a CFA, χ2(329) = 990.762, RMSEA = .08, CFI =

.89. Alkharusi found the instrument to have high internal consistency (α = .84). He concluded


that the TALQ was a viable tool for instructional and assessment purposes in preservice teacher

populations but cautioned that it should also be validated in other countries.

Some researchers have targeted specific skills within assessment literacy in their

development of measures. Donovan, in her 2015 dissertation and subsequent 2018 article,

developed the Teachers Knowledge and Use of Data and Assessments (tKUDA) measure. The

measure specifically captures teachers’ knowledge and use of assessments and assessment data

in educational decision making. Both her dissertation and article described the psychometric

development and validation of the instrument (30 items; 15 for knowledge, and 15 for use).

Donovan used two samples to calibrate (n = 201) and validate (n = 164) the instrument. She

identified assessment knowledge and assessment use as separate constructs and subsequently

conducted Rasch analyses on each set of items separately. She found excellent model fit and

unidimensionality for both constructs. However, at the item level, several items across both

constructs did not fit the model. She did find strong evidence of internal consistency for both

constructs (knowledge, α = .95; use, α = .96). She concluded that further calibration of the items

was warranted, but that results confirmed other researchers’ findings about teachers’ ability to

analyze data from assessments, and use such information for educational decision making.

Ryan (2018) sought to create and evaluate an instrument that would measure the

assessment literacy of preservice teachers. She also investigated the relationship between

assessment literacy and confidence scores and scores on the edTPA (i.e., to assess the convergent

validity of edTPA to prepare assessment literate preservice teachers). She adapted the Classroom

Assessment Literacy Inventory (CALI) by adding a confidence scale item after every literacy item

on the inventory. After piloting the instrument with 165 sophomores and juniors within one

teacher preparation program in the Midwest, she modified the instrument. This initial sample


was used to evaluate the internal structure and validity of the instrument. As discussed above, the

CALI was designed in alignment with the STCEAS. Ryan used Rasch and CFA to evaluate the

internal structure and factor loading of individual items onto the standards (used as subscales for

the measure). She did not “draw any definitive conclusions in support of one internal structure,

but [the CFA] results from this study at least demonstrate that the ‘clean’ and ‘tidy’ Standards-

based conceptualization of assessment knowledge is questionable, and perfect alignment [of the

items] with the seven standards is highly improbable regardless of the sample used” (p. 244).

Ryan suggested researchers may be inclined to use parts of the CALI to assess dimensions of

assessment knowledge based upon other theoretical understandings of assessment literacy.

A subsequent sample was used to investigate the relationship between the CALI and a

teacher performance measure (i.e., the edTPA). The second sample (n = 112 seniors) was

obtained from the same program. Ryan found numerous significant relationships for preservice

teachers’ cumulative GPA with assessment knowledge, assessment confidence, and the edTPA

scores. Specifically, the senior’s assessment knowledge score was moderately and positively

correlated to their edTPA score (r = .257, p = .011). Further, “the significant relationships

between cumulative GPA and edTPA total (r = .381, p < .001) and edTPA Assessment (r = .395)

[ratings] were positive and moderate to strong” (p. 232). When controlling for cumulative GPA,

the significant relationship between assessment knowledge and edTPA performance became

“nonsignificant” (p. 235). She concluded that cumulative GPA was an important predictor of

performance on “external assessment literacy tests and performance-based exams” (p. 235). She

also found that confidence had a moderating, but not predictive, effect on assessment knowledge.

Students in the secondary program reported higher confidence than early childhood program

students, but scored lower on assessment literacy items. She advised that teacher preparation


programs with high academic standards may be able to rely on their curriculum and grading

procedures to demonstrate preservice teacher assessment literacy, rather than using the edTPA.

Researchers have explored relationships between teachers’ assessment beliefs (or

attitudes about assessments) and their assessment literacy (McMillan, 2000; McMillan & Nash,

2003). Quilter and Gallini (2000) devised their own instrument to measure this relationship for a

sample of 117 inservice teachers in southeastern Michigan. Like other assessment literacy

measures (e.g., TALQ, CALI, ALI, tKUDA), this instrument was designed to cover the seven

STCEAS. Quilter and Gallini adapted the TALQ, reducing it to 21 items; 3 per each of the 7

standards. They found the average item difficulty for this sample of teachers was .68, and the

average item discrimination was .28. Thus, the measure was appropriately challenging and

discriminated well at the item level. The measure demonstrated weaker internal consistency (α =

.50) than the original TALQ, perhaps due to reducing the number of items per standard. Quilter

and Gallini found that their sample (N = 117) scored similarly to the original TALQ sample

studied by Impara et al. (1993); 91% of the sample answered correctly for the items

corresponding to the third standard, and 60% of the sample answered correctly for the items

corresponding to the second standard. Teachers’ past experiences with standardized testing and

classroom assessment were positively correlated with their current attitudes, while current

attitudes toward classroom assessment were negatively related to current attitudes toward

alternative assessments. Overall, Quilter and Gallini found that personal experiences and

attitudes did play a more important role than their professional training, and that “teachers’

current attitudes toward educational practice result from a mix of affective and cognitive

variables, with more emphasis on affective variables” (p. 128).


Recognizing that prior affective experiences may have a greater impact than training on

teachers’ educational decision making, Fan et al. (2011) investigated the effectiveness of a web-

based assessment literacy program in enhancing secondary inservice teachers’ assessment

knowledge and perspectives. A sample of 47 secondary math and science teachers participated in

a six-week summer program. The program was unique in its ability to deliver individualized,

situated professional development to the participants. Fan et al. administered a researcher-

devised instrument, the forty item Assessment Knowledge Test (AKT) to evaluate teachers’

understanding of assessment principles (e.g., construction of multiple-choice items, reliability

and validity, item discrimination, etc.) as described in the STCEAS standards (p. 1734). They

also administered a Survey of Assessment Perspectives (SAP) to evaluate teachers’ perspectives

toward assessment functions and procedures. They found that teacher participation in the web-

based program improved their assessment knowledge and perspectives, “especially those in the

low-level prior knowledge group” (p. 1738). They concluded that future research was needed “to

explore whether other factors might have an impact on assessment literacy for inservice teachers''

(p. 1739), particularly whether teachers’ background and instructional experiences moderate

their knowledge and use.

...of Music Teachers.

One of the first investigations of assessment training for inservice music teachers was

conducted by Austin and Russell (2016). Because many states require teachers to obtain master’s

degrees for licensure after a probationary license, Austin and Russell hypothesized that graduate

programs may offer a potential access point for increasing teacher literacy. Using a researcher-

developed instrument, they surveyed faculty from 69 music schools bearing National Association

for Schools of Music (NASM) accreditation (33% response rate). They found that graduate


courses specifically focused on assessment were offered at only 58% of institutions. According

to respondents, when stand-alone assessment courses were not offered it typically was because of

a perception that (a) such material was already adequately covered throughout other courses, (b)

a lack of instructional time and/or limited program enrollment prohibited offering an entire

course focused on assessment, and/or (c) program philosophy did not support assessment as a

major curricular strand. Within institutions that offered an assessment course, only 72% of them

required master’s students to take the course, and only 33% required doctoral students to do so.

Such courses were typically delivered in a face-to-face format, rather than an online or hybrid

format. The most important learning outcomes for assessment courses were developing rubrics,

aligning assessments to objective, and using formative assessments and feedback to improve

student learning. While Austin and Russell examined graduate courses that may lead to increased

assessment literacy, they did not measure assessment literacy of teachers specifically.

...in Summary.

Assessment literacy measurement is attributable primarily to the work of Impara et al.

(1993), Mertler (2000), and Mertler & Campbell (2005) in general education. Nearly all

assessment literacy inventories (e.g. TALQ, CALI, ALI, ALICE) have been based upon the

competencies outlined in the seven STCEAS. Mertler developed two of the most frequently used

and/or adapted instruments, the CALI and ALI. In both of these measures, Mertler used vignettes

to evaluate teachers’ assessment literacy via 35 items corresponding to the STCEAS standards

(five items per standard). In the CALI, the vignettes (and items) present each STCEAS in order

(i.e., items 1-5 represent the first standard, 6-10 the second, and so on). In the ALI, there are five

vignettes, each with seven questions; each question represents one item from the seven

standards. Researchers examining the reliability of these measures with inservice teacher


populations have often found that literacy scores lack adequate internal consistency, and that the

evidence of construct validity corresponding to the seven standards (i.e., the STCEAS) is not

always satisfactory or does not exhibit good fit (Alkharusi, 2015; Ryan, 2018). However, the

CALI remains the most comprehensive measure of teachers’ assessment literacy available.

General education researchers are beginning to investigate the intersection of assessment

literacy and assessment beliefs, because there is evidence that teacher practice may be influenced

by teachers’ personal experiences, conceptions, confidence in their ability to execute

professional judgement related to assessments, and their valuation of assessment. Similar to

McMillan and Nash’s (2001) conceptualization of assessment practice, teacher’s assessment

literacy appears to moderate, in some fashion, their educational decision-making process and

subsequent assessment practices. Many teachers appear to know what and how assessments

should be used, but are reluctant to use them in their instructional practices because of

conflicting notions about why they should assess, how assessment fits into their broader

philosophical beliefs, or the negative consequences of providing students, parents, administrators

or other educational stakeholders assessment information that may be disappointing.

Assessment Beliefs

As previously discussed, education researchers generally use the terms belief,

conceptions, and values interchangeably, with the understanding that beliefs encompass affective

(i.e. values) and objective (i.e., conceptions of what assessment is and how it functions)

dimensions (Fulmer et al., 2015; Opre, 2015). Thus, dimensions of belief include personal

meanings derived from experience, abstract mental images, feelings, forms of knowledge, rules,

and preferences (Box et al., 2015; Opre, 2015). Researchers who have studied assessment

literacy and teacher assessment practices generally have found that internally constructed beliefs


about assessment may moderate the influence of assessment literacy on teacher practices (Brown

& Michaelides, 2011; Deneen & Brown, 2016; Ludwig, 2013; Nyberg, 2016). Further,

researchers have found that teacher beliefs about assessment tend to be reinforced by personal

beliefs about other educational matters (e.g. curriculum, roles of the students and teachers,

content area, etc.) and assessment usage. That is, teachers’ assessment beliefs become reified

over time by their unique contexts and experiences, and reinforce practices that reflect such

beliefs (Box et al., 2015). Most researchers have only addressed teacher beliefs within one

dimension of assessment (i.e. formative or summative, informal or formal). In this section, I will

first review the literature in general education surrounding teacher assessment beliefs. Then, I

will describe emerging research in music education on this topic. Finally, I will summarize the

most significant findings within this body of research.

...of Teachers.

In 2000, McMillan & Nash conducted a qualitative inquiry with 24 elementary and

secondary English and mathematics teachers selected for maximum variation. The purpose of

this study was to explore the reasons teachers give for their assessment and grading practices, as

well as the factors that influenced such decisions. Participant interviews were transcribed and

coded eclectically (i.e. deductively and inductively). The final emergent themes comprised

McMillan’s (2003) conceptual model of teachers’ educational decision making (Figure 1.3, p.

41): this model included teacher beliefs and values, classroom realities, external factors, decision

making rationale, and assessment and grading practices. McMillan and Nash found that “the

most salient internal factor that appears to influence teacher decision making concerning

classroom assessment and grading practices is the teacher’s philosophy of teaching and

learning,'' explained as “assessment and grading practices are whatever will best serve the


purposes that are linked to a larger, more encompassing philosophy of education” (p. 10).

Scholars, to date, have not expanded upon the role teachers’ beliefs may play in assessment and

grading practices.

Gavin Brown, in partnership with other researchers, has established an extensive line of

research addressing teachers’ beliefs and their influence on subsequent assessment behaviors.

Brown (2006) devised the Teachers’ Conceptions of Assessment (TCoA) measure as part of his

dissertation. A subsequent article summarized the psychometric properties of an abridged

instrument. The TCoA - III consisted of 27 (instead of 50) statements using a positively-packed

response scale derived from a previous iteration of the TCoA (n.b., two negative options, and

four positive agreement options). The statements addressed the four main purposes of

assessment; (a) assessment makes schools accountable, (b) assessment makes students

accountable, (c) assessment improves education, and (d) assessment is irrelevant. These were

derived from earlier multi-level and multifactorial models of teacher conceptions as developed

by Brown. The measure was administered to a sample of 692 teachers from Queensland. Brown

used CFA to establish construct validity (χ2(311) = 1492.61, p = .000, RMSEA = .074, TLI =

.80). Items loaded well on each of the first- and second-order factors in the sample. He

concluded that the abridged version of the TCOA was more efficient than the original, while

providing similar quality information.

In 2009, Harris & Brown used phenomenographic approaches to explore the purposes 26

New Zealand teachers ascribed to assessment. Despite Brown’s prior research surrounding this

topic, and development of the TCOA and its related forms, Harris and Brown argued that the

instrument may be limited because it only accounts for four possible purposes of assessment.

They utilized phenomenographic approach because “it is based on the assumption that people


hold multiple, and at times, contradictory conceptions within their frame of reference, making it

impossible to claim that any particular participant ‘holds’ just one specific conception” (p. 367).

Participants were selected for maximum variation and interviewed with a semi-structured

protocol. Data were analyzed in multiple steps. First, preconceived ideas were bracketed or

excluded to reduce researcher subjectivity. Data were coded inductively, systematically and

iteratively within and across cases. Categories were formed by grouping based upon frequency,

position, and pregnancy of the statement. Subsequently, passages were grouped to create “pools

of meaning” (p. 368). Harris and Brown identified seven major purposes of assessment: (1)

compliance, (2) external reporting, (3) reporting to parents, (4) extrinsically motivating students,

(5) facilitating group instruction, (6) teacher use for individualizing learning, (7) joint teacher

and student use for individualized learning. These findings align with other researchers’ reports

that teachers’ educational decision-making is a complex interaction of internal and external

factors.

Remesal (2011) also conducted a qualitative inquiry into teachers’ assessment beliefs to

construct a new model of conceptions of assessment. She used two sequential interview

techniques with 50 primary and secondary math teachers in Spain. First, participants were

interviewed with a semi-structured protocol. Then, one month later, participants were asked to

provide examples of typical classroom assessment material and interviewed using a “critical

event recall” protocol. In this way, the artifacts would serve as a point of triangulation for the

teachers as they referenced their rationale for constructing, implementing, scoring, and

interpreting data. Remesal identified four purposes for assessments spanning a continuum from

pedagogical concerns to accountability concerns. The four purposes that operated along this

continuum were the effect of assessment on (a) learning, (b) teaching, (c) the certification of


learning, and (d) the accountability of teaching. Remesal argued that her findings demonstrated

the complexity of school assessment, and the limitations of current analyses (e.g. dichotomous

distinctions between purposes, or static categorizations) in understanding teachers’ rationales.

Segers and Tillema (2011) used Brown’s abridged instrument, the TCOA-III, to evaluate

Dutch secondary school teachers’ (n = 351) and students’ (n = 712) conceptions of the purposes

of assessment. Using Maximum Likelihood factor analysis (varimax rotation) with the teacher

data resulted in nine factors accounting for 48.8% of the variance. Subsequent models had

inappropriate cross loadings of items between factors. The final 4-factor solution explained 34%

of the variance, and included formative purposes for assessment (19.5%), school accountability

purposes (6.3%), perceptions that assessments are irrelevant or inaccurate (4.6%), and

perceptions that assessments have reliability and validity (3.6%). Segers and Tillema also used

Maximum Likelihood factor analysis (varimax rotation) with the student data, and initially

reported a seven-factor solution accounting for 55.4% of the variance. This model had similar

inappropriate cross loadings of items between factors. The final five factor solution accounted

for 46.2% of the variance. The factors included “supports learning” (18.6%), “student

accountability” (8.8%), “the experience of assessment as enjoyable” (6.9%), “the positive effect

of assessment on the supportive and collaborative climate in the class” (6.5%), and “school

accountability” purposes (5.3%) (p. 51). Segers and Tillema framed their conclusions within the

larger political environment of Dutch education; that is, policy and education experts had been

endorsing a shift from using assessment for summative and accountability purposes (i.e.,

assessment of learning) toward formative and instructional decision-making purposes (i.e.,

assessment for learning). They concluded that, in this sample, students were closer to embracing

this perspective than teachers.


In 2011, Brown et al. developed a new self-reporting inventory to examine Hong Kong

and Chinese teachers’ assessment beliefs. This measure, the Chinese-Teachers’ Conceptions of

Assessment (C-TCoA), introduced two new constructs (“development” and “control”), and was

translated into Cantonese and Putonghua. Modifications to each form were made to obtain a

natural and appropriate flow in each language. Participants were purposively sampled, as random

sampling has been shown to be ineffective in Chinese contexts (p. 310). In a Cantonese region of

China, the researchers obtained 1014 survey responses (69% response rate), and in Guangzhou,

they obtained 898 survey responses (80%). They used EFA and CFA to develop a well-fitting

model that explained Chinese teacher responses to the instrument. Brown et al.. found a 7-factor

model with acceptable fit (χ2(414) = 3479.15, RMSEA = .062, CFI = .87). But, upon closer

inspection of factor intercorrelations, the researchers determined the model may require a second

order structure. The second hierarchical model with 3 intercorrelated factors had worse fit, but it

was still acceptable (χ2(426) = 3856.94, p < .001, RMSEA = .065, CFI = .85). They also tested

the invariance of the model with both groups (i.e., Hong Kong and Guangzhou), and found

statistically significant differences in the way the two groups answered the instrument. They

concluded that the new inventory — adapted specifically for Chinese contexts — was a valid

tool to measure Chinese teachers’ assessment beliefs. The role of accountability and control was

found to be important to this population.

In another investigation, Brown et al. (2011) administered the TCoA - IIIA to a sample of

1525 teachers (47.3% response rate) of Queensland, Australia. This instrument is importantly

and significantly different from the TCoA - III in that two of the purposes are considered

hierarchical: “improvement” contains four sub-factors each with three items, and “irrelevant” has

three sub-factors each with three items. Brown et al. found that the CFA model for Queensland


teachers was inadmissible; that is, it did not fit the data. Because the model was constructed, like

all CFAs are, based on a priori theoretical hypotheses, the research team speculated that it may

be necessary to consider alternative models with different paths. They also theorized that teacher

characteristics (e.g. primary or secondary) may not be invariant. Subsequently, they specified

models for primary and secondary teachers. They found, after also introducing two new paths

between first order factors, that the respecified model had considerably improved fit (χ2(309) =

2741.56, p < .04, RMSEA = .05, CFI = .846). This means that the two groups did have

independent responses to the instrument. The researchers used MANOVA to investigate if this

difference was significant, and found that the groups were significantly different, but the effect

size was small to moderate. Brown et al. concluded that the quality of the instrument was “such

that differences between groups were most likely due to real population differences rather than

chance artefacts in responding to the questionnaire” (p.218). This was important, because

subsequent research by Brown and others would be conducted on the basis that differences

between cultural groups needed to be measured. With regards to policy and teachers’ assessment

conceptions, Brown et al. argued their results demonstrate that “policy makers, professional

developers, teacher educators, and administrators may have failed to persuade teachers that the

currently available assessment systems provide informative, valid, and improving techniques” (p.

218).

Brown et al. (2012) sought to determine if teachers’ understandings of feedback (n.b.,

“understandings” as the analog for beliefs, feedback as the analog for formative assessment)

influenced the type and quality of feedback provided to students. Using a self-devised

instrument, the Teachers’ Conceptions of Feedback Inventory (TCoF), the researchers solicited a

sample of 518 New Zealand teachers to respond to six-point, positively packed agreement scales


corresponding to assessment purpose and task statements. The researchers found, using structural

equation modelling, that teacher beliefs about formative feedback were influenced by “feedback

practices that they control (i.e., teacher-centric) [that] are used for the explicit purpose of

improving the quality of student learning outcomes” (p. 974). They also found that “teachers’

understanding of feedback as a set of practices in which they are not involved was predicated by

an understanding that feedback requires self- and peer-interaction” (p. 947); teachers appeared to

believe assessment was something done to students rather than something done with students.

Allal (2013) investigated Swedish teachers’ assessment beliefs through the lens of

socially-situated practice by conducting two interviews with each of 10 sixth-grade teachers.

Prior to each interview, teachers were asked to select two students “for whom the teacher had

hesitated when completing the students’ report card between the grades of 3 (‘objectives nearly

attained’) and 4 (‘objectives obtained’)” (p. 25). Allal’s rationale for focusing on this scenario

was that teachers would best be able to display evidence of professional judgement, and that

requiring teachers to bring materials to the interview would help them re-enact their practice and

reasoning in authentic ways. Data were coded inductively. The major themes were “professional

judgement as a cognitive act” and “professional judgement as a socially situated practice” (p.

27). Allal found that teachers used, more often than other appraisal techniques, informal

observation as the basis for making final decisions in these scenarios. Teacher participants’

frequently articulated that they made these decisions using the “sum” of students’ efforts within

classes but were not often able to point to specific examples of student work. Allal concluded

that future efforts to enhance teachers’ practices and professional judgement should focus on

being able to document and communicate coherently and transparently the rationale of their

grading procedures.


Azis (2015) used an explanatory mixed methods design to investigate teachers’

assessment beliefs and how such beliefs related to their practices. Using a sample of 107 middle

school English teachers, Azis administered Brown et al.'s (2010) Teacher Conceptions of

Assessment (TCoA). This measure consisted of several subscales measuring teacher agreement,

indicated via 5-point scales, with items representing three dimensions of assessment function (as

cited in Brown et al., 2009) -- improvement, accountability, and irrelevance. Azis found the

highest levels of agreement with items related to the improvement dimension, closely followed

by accountability. In the follow-up qualitative phase, Azis interviewed four teachers who agreed

strongly with the improvement dimension. She found that teachers in this group “believed the

main purpose of assessment was to inform teaching” (p. 138), and that, subsequently, they

favored teacher-constructed tools and authentic tasks. The participants also felt external or

standardized assessments reduced their autonomy, had a negative impact on the equity of their

students, and were not credible (p. 143). In total, Azis found that teachers’ pattern of TCoA

responses were inconsistent with their practices; that is, teachers’ beliefs about the purposes of

assessment contradicted their actual practices.

Hidri (2015) explored assessment beliefs in Tunisian secondary school (n = 336) and

university teachers (336) using Brown’s (2006) TCoA - III measure. Subsequent analyses were

carried out in four phases: EFA, PCA, dimension analysis, and CFA. Hidri used EFA to

investigate possible factor structure. She used Monte Carlo PCA parallel analysis -- where the

sample data was set at 100 cases, the PCA was set at 2, and the desired percentile was set at 95%

-- to determine the statistically significant eigenvalues of each factor based on random data

generation. Then, she used dimension analysis to estimate the correct number of factors; this

produced different results from phase one to phase two. Finally, Hidri used CFA to investigate


the relationship paths between variables, to test data fit, and to check indicators moderating

influence on factors. Hidri found discrepancies between each method of analysis. Data were not

separated (i.e., secondary and university teachers), which Hidri acknowledged may have created

issues for fitting models to (what may have been) data that were not invariant. With regards to

the results of the TCoA - III, Hidri concluded that “conflicting conceptions of assessment might

also impact teachers’ practices”, and that despite finding her models were divergent from

previous studies by Brown, there was a strong relationship between Tunisian teachers’

conceptions of assessment purposes for accountability and for improvement.

Ludwig (2013) investigated the relationship between 160 upstate New York public

school teachers’ conceptions of assessment and their confidence in their assessment knowledge.

She used Brown’s (2006) TCoA - III to evaluate teachers’ assessment beliefs, and Arter &

Busick’s (2001) Classroom Assessment Confidence Questionnaire (CACQ). She found that

teachers’ collective self-perceived agreement was greatest for the student accountability purpose

of assessment., but participants tended to rate themselves as most confident in effectively

communicating assessment results. Ludwig found a statistically significant difference in

confidence scores for teachers with varying types of assessment training. She concluded that

teachers should be provided with opportunities to (a) reflect on their assessment beliefs, (b)

collaborate with peers, and (c) enhance their assessment literacy, especially with regards to

improving student involvement in assessment.

Brown et al. (2015) examined teachers’ assessment beliefs in an Indian context, because

teachers are the main agents of educational reform in Indian schools. The researchers

hypothesized that teachers’ beliefs about the purposes of assessment would predict their

assessment practices, and that responses would differ between external (i.e. government or


publisher created) and internal (i.e. teacher constructed) assessments, with preference given to

formative and diagnostic functions. Teachers were assigned to a modified TCoA - III instrument

and answered questions about either internal (n = 603) or external (n = 649) assessments. All

teachers also took the Teacher Practices of Assessment (PrAI), a 32-item inventory developed in

Hong Kong to identify teacher agreement with assessment practices for specific purposes. As in

prior studies, Brown et al. used CFA to test the fit of a set of pathways within and among factors,

and to establish construct validity. The first nine-factor model was rejected because of negative

error variance in three 1st-order factors. The final four-factor model included the following

factors: improvement, irrelevance, control, and school quality accountability (χ2(293) =2254.88,

RMSEA = .06, CFI = .82); groups were structurally invariant. For the PrAI, the researchers found

a four-factor solution using 29 of the 32 items with good fit (χ2(269) = 1887.16, RMSEA = .06,

CFI = .88). With regards to teachers’ conceptions and beliefs about the purposes of assessment,

researchers found that “regardless of internal or external conditions [they] still see assessment

predominantly around improving student learning by teaching for exams” (p. 59). The

researchers recommended that new resources and programs were needed to support teachers’

formative assessment practices. They lamented that “simply put, it seems that teachers in quite

diverse contexts believe a good school’s effect is seen on better examination performance”

(p.60).

Barnes et al. (2017) are amongst the few and most recent to examine U.S. teachers’

assessment beliefs. Using a sample of 179 participants from the northeast region of the United

States, researchers administered Brown’s abridged (2006) TCoA - III. They conducted EFA and

found that Brown’s four factor model did not fit their data. They then used principal axis

factoring with promax rotation to identify a factor structure that best fit the data. They found a


three-factor solution that accounted for 51.54% of the variance. They called the three factors

“assessment as valid for accountability” (10 items), “assessment improves teaching and learning”

(6 items), and “assessment as irrelevant” (8 items); thus, they merged the formerly separate

accountability factors from the school and teacher level. Barnes et al. theorized that “the items

considered to tap assessment for student accountability may be seen by teachers as a middle-

ground between assessment for teaching and learning and the more extreme accountability end

of assessment that is present in many U.S. teaching contexts” (p. 114). Like many researchers

before them, they also urged future researchers to “tease out how these conceptions of

assessment influence practice” (p. 115).

Most recently, Olsen and Buchanan (2019) conducted an inductive, multiple case study to

explore secondary teachers’ grading beliefs and practices. After reviewing many of the studies

discussed previously in this chapter, Olsen and Buchanan argued that modern approaches to

grading (e.g. standards-based, four-point systems, software packages or other technology)

necessitated an inductive examination of teachers’ grading practices. They collected data from

15 teachers and 2 principals in two New York schools undergoing a year-long professional

development program targeting grading practices. Data consisted of observations of professional

development meetings, semi-structured interviews with individual teachers throughout the year,

and grading documents teachers produced.

Olsen and Buchanan (2019) found evidence that teachers held seemingly contradictory

beliefs about the purposes of grading (i.e., a hodgepodge of academic and nonacademic factors),

reported a lack of professional development around grading for teachers (i.e., limited preservice

and inservice training about sound grading practices), expressed conflicting messages about the

purposes of schooling, and frequently adapted grading approaches to fit their classroom realities.


In interviews, teachers also noted the influence of their own personal experiences with grades as

students and teachers, and their desire to see students be successful (i.e., teachers felt guilty when

students demonstrated low achievement). Many teachers commented that grading practices

endorsed by the professional development program failed to capture the myriad ways in which

students learn or demonstrate growth. That is, they felt using strictly academic criteria was unfair

to a segment of students who put forth appropriate effort, were compliant, or showed growth but

not proficiency in a given skillset. Overall, Olsen and Buchanan noted that teachers’ grading

practices evolved slowly and recursively. They concluded that in order for such professional

development to be effective, schools or districts needed complete teacher buy-in.


Music education researchers have only recently begun examining assessment beliefs,

largely to determine if they have an impact upon future preservice and inservice teacher

assessment practices. Leong (2014) examined Singaporean music teachers’ conceptions of

classroom assessment in an attempt to “unravel the multiplicity of significant relationships of

concepts within the specific context of music teachers’ classroom decision-making

environments” (p. 454). Through the theoretical lens of Alexander’s (1992) conceptions of music

classroom assessment, Leong examined the complex intersection of music teachers’ values,

beliefs, needs, required professional duties, and knowledge about assessment. Leong utilized Q-

Methodology, a textual analysis completed by computer programming such as PQMethod and

PCQ, to distinguish the components of 30 teachers’ conceptions about assessment. This

programming organized text into factors, similar to how continuous data can be organized by an

EFA. The teachers were from primary, secondary, university, and department of education

contexts. The textual data were extracted and organized into core thematic statements and loaded


onto four factors: efficient, evolving, embedded, and empirical. Several statements were

represented across all factors (e.g. “I believe in making an effort to select/design the most

appropriate assessment task for my students” and “I believe there is always room for

improvement when it comes to assessment” (p. 461). The factors were not loaded uniformly

across participant characteristics; that is, one factor was generated from exclusively primary

teachers, while another was generated by only university professors and department of education

staff. Leong concluded that:

“classroom assessment, like many aspects of classroom teaching and learning is not a

stable entity. Rather it is highly variable, contested, and irreducibly situated in a specific

context. The different conceptions of what classroom assessment practice entails suggest

there are many, often conflicting mediating influences with which teachers need to

grapple” (p. 464).

Leong’s conclusion gives credence to the notion that assessment beliefs are a complex

interaction of teachers’ feelings, experiences, needs, and professional obligations.

Nyberg (2016) used participatory action research as a method to explore secondary music

teachers' conceptualizations of musical knowledge, learning, and communication (i.e.,

communicating learning outcomes) in a Swedish school. Nyberg, two administrative music

teachers, and five classroom music teachers formed a core group at a secondary school with the

intention to “increas[e] students’ goal-related achievements” and “develop practices through a

municipal drive for funding of research and development projects...on assessment of musical

knowledge and learning” (p. 244). In such a qualitative approach the participants are also

collaborators in investigating a problem. In this study, the core group met eight times to

determine what would be assessed and how, and to work through their conceptions of what


assessment should look like. The meetings were recorded and generated into transcriptions and

field notes. Additionally, participants maintained journals, which were subsequently used to

begin discussions. Nyberg found that teachers’ conceptions of assessment were, in part,

influenced by how they conceived of learning; that is, it was typically more difficult to create

assessments for knowledge or skills that teachers perceived as “holistic.” While many of

Nyberg’s findings reflect the characteristics and context of Swedish schools, stakeholder

demands for accountability and transparency of assessment data is reminiscent of the United

States in the early 1990’s.

Austin and Russell (2017) examined the role assessment beliefs and occupational identity

may play in accounting for inservice music teachers’ assessment practices. They surveyed over

9000 secondary music teachers in the United States, with 423 providing complete and usable

data (6% response rate; sampling error of +/- 5%). The researcher-developed questionnaire

included a mix of selection-type and rating scale items that respondents used to describe their

assessment and grading criteria and practices, and a separate section of rating scale items, from

Brown’s (2006) instrument that teachers used to report their beliefs about assessment. Items

addressing beliefs corresponded to functions of assessment (formative (7), α = .87; summative

(11), α = .87; accountability (7), α = .91) and beliefs about the value of assessment (positive

valence (9), α = .91; negative valence (8), α = .88). To measure role identity, Austin and Russell

used 6-point scales for respondents to self-report their degree of agreement with six items

corresponding to teacher (2) or performer (4) identities. Finally, respondents were also asked to

report the weighting of grading criteria representing performance skills, attitude, attendance, and

musical knowledge. Austin and Russell found that teachers who valued assessment more were

more likely to target musicianship outcomes in their grading practices and identify with the


teacher occupational identity, while teachers who devalued assessment were more likely to target

extramusical (i.e., behavioral) outcomes.

In 2019 Austin & Russell investigated preservice teachers’ conceptions of assessment

and projected assessment practices. Using a researcher-developed instrument, they surveyed 75

music education majors from eight institutions across four regions in the U.S. They found

respondents favored the formative functions of assessment over summative and accountability

functions. Students who reported receiving greater amounts of training in assessment valued

assessment more than less-trained peers, and were more confident in their assessment abilities.

Most participants reported two or fewer music education class sessions devoted to assessment

topics, yet one-third felt “very or extremely” confident in their ability to assess future students.

Austin and Russell’s investigation explores both the function and value conceptions of

assessment.

...in Summary.

Assessment beliefs are a complex, intersectional sum of teachers’ experiences as teachers

and students, philosophical beliefs about teaching and learning, and beliefs about the purposes of

assessment. General education researchers have been strongly influenced by the work of Brown

(2006) and his colleagues, especially given his prolific testing of the TCoA inventory in contexts

worldwide, from Tunisia to Indo-China. The TCoA, and its many iterations, are based upon

Brown’s (2006) finding that teachers’ assessment beliefs were centered around (a) accountability

at the school level, (b) accountability at the classroom level, (c) formative feedback, and/or (d)

perceptions of irrelevance. The TCoA measures use this a priori classification of beliefs, which

may not be appropriate for all teaching populations or contexts (Brown & Gao, 2015). Few

researchers have explored teachers’ assessment beliefs in the United States. Barnes et al. (2017)


recently used the TCoA with a population of U.S. teachers, but found that Brown’s a priori

classification condensed to three factors, and demonstrated poor fit. As a result, some researchers

have used exploratory or inductive methodologies to investigate teachers’ assessment beliefs

(Leong, 2014; Remesal, 2011). Music education researchers have only begun to examine music

teachers’ assessment beliefs and how they may intersect with assessment practice. Little is

known about music teachers’ assessment beliefs, especially in the United States; thus, future

investigations of the dimensions of assessment belief, and development of a reliable instrument

to capture music teachers’ beliefs, are warranted.

Assessment Practices

The decisions teachers make about when and how to employ assessment in their

classrooms comprise only a small portion of the professional judgements made in a given day;

planning and preparing learning activities, implementing lessons, managing student behavior,

communicating with parents, other teachers, and administrators, and providing additional support

or assistance to students are the bulk of the activities in which teachers engage. As discussed in

Chapter 1, teachers’ assessment practices appear to lag behind best practices associated with

reliable, valid, fair, and useful assessment, but those practices also reflect broader societal and

policy trends related to education reform. In the following section, I discuss specific literature

pertaining to assessment practices employed by teachers (i.e., classroom specialists, not music

teachers) and teachers’ perceptions associated with specific practices, and then review

assessment practice research involving music educators; within each section, studies will be

presented chronologically based on year of publication.


...of Teachers.

In 1995, Oosterhof et al. conducted an observational study of 15 Floridian public-school

teachers to explore current classroom assessment practices. Two teachers were from elementary

schools, seven from middle schools, and six from high schools. Teachers were selected for

maximum variation. Each teacher was observed for five consecutive teaching days in order to

“gain insight into a number of teacher and student behaviors, perceptions, and strategies that

perhaps go unnoticed when shorter periods are involved” (p. 2). Observations were documented

as 10-minute interval estimates of the amount of time that teachers and students engaged in (a)

formal assessment, (b) informal assessment, (c) integrated assessment and instruction, (d) other

on-task activity, and (e) off-task activity. Observations were also recorded in field notes.

Oosterhof et al. found -- consistently across classrooms -- that teachers primarily engaged in

informal assessment and integrated assessment and instruction techniques (i.e., “show of hands”,

questioning techniques) with students. Oosterhof et al. reported “one of the more surprising

findings” was the “limited or non-existent time teachers have for developing assessment skills”

(p. 12). They concluded that “if practicing teachers' abilities with critical measurement skills are

less than acceptable, perhaps we need to give careful consideration to selecting a subset of skills

with which we will train teachers well” (p. 12).

Building upon earlier work, Zhang (1996) administered his Assessment Practices

Inventory (API) to 311 inservice teachers to determine the hierarchy of teacher assessment

competencies -- as outlined in the STCEAS standards -- using Rasch analysis. The API consisted

of 67 items describing specific classroom assessment practices, to which teachers responded

using a 5-point scale from “not at all skilled” to “highly skilled.” Zhang used principal

component analysis and determined that teachers’ responses organized into six components.


Subsequently, the logit of the items belonging to each component was calculated. The result of

these analyses was a list of the assessment practices perceived to be most difficult. Of the six

categories -- relative to the STCEAS standards -- interpreting standardized test results,

conducting classroom statistics, and using assessment results in decision making were perceived

to be the most difficult practices by inservice teachers. Communicating assessment results was

perceived to be the easiest. Non-academic grading practices (i.e. participation, attendance, effort)

had the second highest logit score, meaning that teachers perceived these practices to be

relatively easy. Zhang concluded that these findings confirmed other researchers’ assertions that

teachers were unskilled in developing, implementing, and interpreting assessment data.

As standardized testing became a regular feature in classrooms of the late 1990’s,

researchers began measuring teachers’ perceptions of standardized testing and the relevance of

standardized testing to their assessment and instructional practice. Goldberg & Roswell (1998)

gathered instructional materials from Maryland teachers who had previously scored the

Maryland School Performance Assessment Program (MSPAP) state tests, and subsequently

surveyed 50 Charles County teachers about the impact of scoring MSPAP on their teaching and

perceptions of how the MSPAP “integrated into their own and their colleagues' instructional and

classroom assessment practices” (p. 5). They also interviewed 12 Charles County teachers with

more than one year of scoring experience. They found that teachers believed the MSPAP scoring

experience had either improved or had the potential to improve their teaching and assessment

practices. The two significant themes from the interviews with teachers corroborated the

questionnaire results; namely, “scoring was such a valuable experience that it would be ideal if

every teacher and administrator could score”, and “scoring gives you the ‘big picture’ and serves

as a ‘wake up call’” (p. 14). Goldberg & Roswell also found that teachers’ instructional materials


became grounded in “plausible, real-life situations, problems, issues, or decisions, and are

comprised of a series for which the purposes are clear and authentic” (p. 20) after scoring

experience. Teachers also became more effective in aligning instructional materials to state

standards and learning indicators. Goldberg & Roswell surmised that professional development

approximating the experiences teachers had scoring the MSPAP was imperative to enhancing

teacher assessment practice.

As education reform efforts turned toward accountability, state departments of education

and policy organizations began developing rubrics and other methods of evaluating teacher

assessment practices. Aschbacher (1999) developed an evaluation framework and rubrics based

upon the participation of 24 third- and seventh-grade teachers. Teachers were asked to submit

exemplars of typical classroom assignments and student work in a binder. Collectively, the

teacher participants contributed 136 teacher-constructed assignments, with four pieces of student

work for each assignment. Aschbacher evaluated the assignments using five 4-point scales to

evaluate the cognitive demand of the task, the clarity of grading, the alignment of the task with

learning goals, the alignment of grading criteria with learning goals, and the overall task quality.

Aschbacher found that the “vast majority of assignments collected for this study at both

elementary and middle schools made relatively low-level cognitive demands on students” (p.

26). She also found that students encountered “coherent assignments less than half the time

based on assignments submitted,” particularly at the middle school level (p. 30). That is, they

were not typically aligned to teachers’ goals, nor were grading practices. She also found that

students received no feedback on over one-third of the assignments, and that teachers’

perceptions about what constitutes high- or low-quality work had a “low to moderate”

correlation to the raters’ appraisal of the student work examples. Aschbacher concluded that her


“approach to measuring classroom practice through ratings of a sample of assignments shows

promise in its capacity to describe several important aspects of the classroom learning

environment… and in suggesting areas for administrative attention, professional development,

and teacher reflection” (p. 41). A notable implication of Aschbacher’s study is the potential value

of amassing teacher-constructed materials and student work exemplars in appraisals of teachers’

instructional effectiveness and the integrity of their assessment practices.

In 2000, Mertler conducted a study about teachers' assessment practices and/or literacy

that would subsequently inform this line of research. The purpose of this descriptive study was to

explore the assessment practices of teachers in Ohio; specifically, the methods used to ensure

construct validity and reliability of classroom assessments. Using a stratified random sample of

K-12 teachers, Mertler developed and administered the Ohio Teacher Assessment Practices

Survey to 625 teachers. Respondents were asked to list specific steps used to ensure that their

assessments aligned to objectives (i.e., validity), and how often they used these procedures on a

five-point scale (e.g. 1 = never to 5 = always). Respondents were asked the same question about

steps used to ensure assessments yielded consistent student scores (i.e., reliability). Responses

were coded into six categories. More than half of the responses involved teacher constructed

tests. The remaining categories included “compare to objectives”, “analysis of test data”, “I don’t

determine validity”, “asking for student feedback”, and miscellaneous (p. 32). With regards to

preparation, only 13% of respondents indicated that they felt “well prepared” to assess student

learning (p. 34). Mertler concluded that his findings confirmed those of earlier research implying

that teachers were unprepared to conduct and interpret assessment data.

McMillan (2001) surveyed 1,483 teachers (65%) in seven urban Virginia school districts

about their assessment and grading practices. To document various assessment and grading


practices, McMillan used closed-item 6-point frequency scales. Using principal component

analysis, McMillan found that teachers utilized a “hodgepodge of factors” to determine grades:

academic achievement, “academic enablers” (i.e., non-academic factors such as effort, ability,

attendance, and participation), use of external benchmarks, and the use of extra credit

assignments or tasks (p.28). He also found significant differences between teachers of different

subjects; math teachers, for example, were less likely to use academic enablers than English or

social studies teachers. Perhaps alluding to his work with Nash in developing a conceptual model

of educational decision making, McMillan concluded that teachers’ assessment practices are

inexorably connected to other demands influencing teachers’ educational decision making,

including their desire to document learning and motivate students.

McMillan et al. (2002) subsequently examined the assessment and grading practices of

921 elementary teachers in Richmond, Virginia public schools (58% response rate). Mirroring

McMillan’s (2001) study of secondary teachers’ assessment and grading practices, the

researchers developed a similar instrument. This instrument, however, also incorporated

questions “to emphasize actual teacher behaviors in relation to a specific class of students, rather

than more global teacher beliefs; teacher responded to all items once for language arts and once

for mathematics” (p. 206). In this way, the researchers could approximate elementary school

teachers’ unique context; they validated the instrument with the assistance of 15 elementary

school teachers. As McMillan did in 2001, the researchers used principal component analysis to

organize teachers’ practices. This analysis resulted in three components: teacher-constructed

performance tasks (e.g. constructed-response assignments, essays, projects, etc.), publisher- or

district-provided performance tasks, and teacher-constructed achievement tasks (i.e., major

exams and tests). Except for the frequency with which tasks were used, the researchers found no


significant differences between mathematics and English teachers’ assessment and grading

practices. The researchers also found confirmatory evidence suggesting “within-school variance

is greater than between-school variance; individual teacher preferences are more important than

are differences between schools in determining grading practices” (p. 212). This strengthens

McMillan’s earlier findings that teachers’ educational decisions involve the reconciling of a

complex assortment of competing internal and external needs (2000, 2001).

Building upon McMillan’s earlier work, Frey and Schmitt (2010) investigated the

assessment practices of 3rd- through 12th-grade teachers in Kansas. The researchers devised

their own survey instrument. Respondents were asked to use provided definitions for terms (n.b.,

traditional tests, performance tests) and answer questions about six aspects of classroom

assessment, including usage as measured by the “estimated percentage of the time they use

various types of assessments” (p. 109). Respondents were also asked for demographic data such

as their gender, subject area, and years of teaching experience. Frey and Schmitt found no

relationship between years of teaching experience and the frequency with which teachers used

teacher-constructed or publisher-constructed tests. They did, however, find a relationship

between teaching experience and the tendency to use short-answer formats as well as

performance-based (i.e. authentic, alternative) formats – both of which were more common

among less experienced teachers. They also found a significant difference between teacher

assessment practices and teacher gender, subject, and level. Female teachers were more likely to

use performance-based assessments than male teachers, and less likely to use teacher-constructed

assessments. Elementary teachers were less likely to use teacher-constructed tests. Frey and

Schmitt bemoaned that “a generation after the call for improved assessment practices . . . the


research focus is overwhelmingly on large-scale test development with little emphasis on

assisting teachers in developing high-quality classroom measures” (p. 116).

Researchers other than Zhang (1996) have used the STCEAS as a framework to assess

teachers' assessment knowledge and practices. In her 2014 dissertation, Gutierrez administered

the Assessment Literacy Inventory for Classroom Educators (A.L.I.C.E.) to a population of 94

Midwestern suburban teachers from three separate middle schools. She found that most teachers

in the sample reported never taking assessment coursework in their undergraduate or graduate

preparation. Additionally, despite reporting greater than ten hours of professional development,

as well as access to instructional coaches and regular administrative meetings, teachers felt they

lacked a sound understanding of assessment principles. For this population, Gutierrez found that

teacher training “explains 17% of the variability in classroom assessment practices, while

teachers’ assessment knowledge explains 38% of such variability in assessment practices” (p. 4).

She concluded that the findings supported previous research in that many teachers fail to receive

formal assessment coursework, but that training in assessment significantly increases the

likelihood of using a wider variety of assessment practices.

Researchers have also displayed a renewed interest in investigating the personal and

contextual factors associated with teachers’ use of assessment to make educational decisions.

Box et al. (2015) used the Personal Practice Assessment Theories (PPAT) framework as a lens to

investigate the educational decision making and assessment practices of three science teachers in

a west Texas suburban community. Like McMillan and Nash’s (2000) conceptual model of

assessment, this framework incorporates teachers’ contextual circumstances into the rationale for

educational decisions and assessment practices. Using ethnographic methods and a multiple case

study approach, Box et al. collected interview and artifact data. They found distinct differences


in assessment practices among the three teachers, confirming prior research that individual

teacher assessment practices have great variance due to the contextual demands of their

classrooms, schools, and districts.

Fulmer et al. (2015) explored the contextual factors affecting teachers' assessment

practices. In their conceptual model, they reviewed research that may explain teacher assessment

practices from micro-, meso-, and macro- perspectives. They found that most research addresses

teacher assessment practices at the micro- (i.e., local, classroom based) level, and that there are

fewer researchers conducting research at the meso- (i.e., school) level, or seeking to connect

micro- and macro- (i.e., national, policy level) contexts. They divided their discussion of

literature by each level and relevant topics. For the micro-level, they discussed research

surrounding teachers’ conceptions, beliefs, knowledge, and value of assessment, in addition to

specific teacher background variables. For example, they reported that teachers’ roles and

experiences have been shown to influence their assessment practices; teachers with greater

managerial responsibilities tended to value and utilize sound assessment principles with greater

frequency (p. 483). They also reported that secondary school teachers of science and humanities

have wider gaps between their values and practice than do teachers of language and creative arts

(p. 483). They concluded that researchers should strive to investigate teacher practices, values,

and knowledge of assessment at each of the three levels; untangling the complex interaction of

contextual (i.e., micro-) and circumstantial (i.e., meso- and macro-) variables is key to

understanding how to improve teacher practice.


Music teachers’ uses of assessments are varied. Researchers have consistently found that

when music educators assess student learning, most employ informal and formative assessments


such as “in class, down-the-line” performance assessments that are designed to provide checks of

skill or accuracy (Hill, 1999; Kancianic, 2006; LaCognata, 2011; McClung, 1996; McCoy, 1988,

1991; McQuarrie & Sherwin, 2013; Russell & Austin, 2010; Simanton, 2000). But, the bulk of

student appraisals are accounted for by “non-academic criteria” such as attendance and attitude

(Russell & Austin, 2010). Researchers have ascribed this phenomenon to several potential

causes: circumstantial or logistical challenges facing teachers (McClung, 1996), a lack of

administrative oversight and support when assigning grades (Russell & Austin, 2010), the

common conception amongst music educators that assessment is not appropriate for the

“subjective” experience of music (Denis, 2018), and a lack of teacher preparation in assessment

knowledge and use (Austin & Russell, 2019).

Most music education researchers have examined assessment practices through a

dissertation project; eight dissertations by music educators were published in the 1990’s and

early 2000’s. The dissertation authors typically described music educator assessment practices

(n.b., Table 2.1). All but two dissertations were written by music educators with an interest in

exploring secondary -- particularly high school -- band teacher practices.

Table 2.1. Dissertations about Music Teacher Assessment Practices

Date Author Institution Title

1996 McClung Florida State

University

A descriptive study of learning assessment and grading

practices in the high school choral music performance

classroom

1999 Hill University of

Southern

Mississippi

A descriptive study of assessment procedures, assessment

attitudes, and grading policies in selected public high school

band performance classrooms in Mississippi

2001 Hanzlik University of

Nebraska

An examination of Iowa high school instrumental band

directors' assessment practices and attitudes toward assessment

2001 Simanton University of

North Dakota

Assessment and grading practices among high school band

teachers in the United States: A descriptive study


2002 Sears University of

Massachusetts

Lowell

Assessment in the instrumental music classroom: Middle

school methods & materials

2006 Kancianic University of

Maryland -

College Park

Classroom assessment in U.S. high school band programs:

methods, purposes, & influences

2006 Sherman Teachers

College,

Columbia

University

A study of current strategies and practices in the assessment of

individuals in high school bands

2010 LaCognata University of

Florida

Current student assessment practices of high school band

directors

McClung (1996) described learning assessment and grading practices in high school

choral contexts. Using three samples, a student sample from the 1995 Georgia High All-State

Chorus (n = 615; 100% return rate), a high school choral teacher sample (n = 160; 80% return

rate), and a high school principal sample (n = 150; 78% return rate), McClung surveyed

participants about the purposes and practices of assessment. He found that all groups perceived

grades to be an important component of the choral experience, and that the process of assigning

grades could potentially affect the public’s perception of the legitimacy and value of high school

choral music. Most notably, McClung found strong support for use of extra-musical criteria in

grading and assessment practices from all samples, including participation, attitudinal criteria,

and attendance. McClung speculated that these extra-musical criteria may provide opportunities

for students to demonstrate “achievement in the affective domain”, or for teachers to credit less

musically talented students for growth. His findings were mirrored in subsequent dissertations on

music teacher assessment practices.

In his dissertation, Hill (1999) examined the assessment practices, procedures, and

policies in high school band contexts in Mississippi. Like McClung, Hill surveyed a population


of students (n = 327; from the Mississippi Bandmasters’ Association State Band Clinic), teachers

(n = 93; Mississippi Bandmasters Association), and principals (n = 38; randomly selected public-

school administrators). Hill found that grades were considered an important part of the

instrumental classroom. He also found that extramusical (i.e., non-academic) criteria were

important and significantly weighted in grading practices. These criteria included attendance,

participation, and attitude. Hill also asked participants about their use of a range of assessment

formats, from portfolios to traditional paper-and-pencil, and found that fewer than one out of

four teachers employed any academic achievement criteria in their assessment and grading

practices. Hill’s findings reflect the acute issues that professional organizations in the 1990s

sought to allay.

Hanzlik’s (2001) dissertation marks the first efforts of a music education researcher to

examine assessment practices and beliefs in tandem, as well as an effort to document differences

in assessment beliefs based on teacher differences. Hanzlik developed the instrument -- the

Survey of Band Directors Attitude Toward Assessment (SBDAA) -- and administered it to a

sample of 200 randomly selected band directors in Iowa; 154 surveys were returned (77%

response rate). Contrary to Hill and McClung’s findings, Hanzlik found that Iowa band directors

tended to use playing or performance task assessments the most, followed by attendance,

“teacher observation”, participation, and sight-reading (p. 6). Hanzlik concluded that “the

emphasis of the instructional process in Iowa band rooms seems to be clearly on performance

learning and not on cognitive or affective learning” (p. 6). He also found that band teachers

generally held positive attitudes toward assessments. Using ANOVA and hierarchical regression,

Hanzlik found only one significant group difference, with band directors in Class A schools

exhibiting more positive attitudes toward assessment than band directors in larger Class 1A


schools. Additional analyses revealed a significant curvilinear relationship between high school

teaching experience and attitude toward assessment, with directors with the most and least

experience having more positive attitude scores than directors with 10-25 years of teaching

experience. These findings may have reflected the unique teaching context of Iowa at that

particular time.

Simanton (2001) conducted his dissertation the same year as Hanzlik. Like Hanzlik,

Simanton was concerned with the growing emphasis on academic reform and accountability, and

how these emphases might mete out in music teacher assessment practices and beliefs. Using a

researcher-developed questionnaire, Simanton surveyed 202 high school band directors via a

regionally stratified sample (based upon six regions defined by the Music Educators National

Conference [MENC]). He found that few band directors -- nationally -- employed sound

assessment practices, but that they were highly satisfied with their current practices. Simanton

also found that directors of smaller bands, or bands led by directors with graduate degrees,

tended to use sound assessment practices with greater frequency. Further, there appeared to be

regional differences for director use of grading criteria and time spent on assessing students. Like

McClung, Simanton cited workload and contextual factors as significant reasons for directors’

deficiencies in assessment practice.

Sears’ (2002) master’s thesis was unique in that she sampled middle school instrumental

music instructors (i.e., band and orchestra). The purpose of the study was to describe the types of

assessments in use by middle school instrumental instructors in Massachusetts. She sent

questionnaires to eighty schools across southeastern Massachusetts, and a total of 42 responded

(52.5% response rate). Like McClung and Hill, Sears found that the majority of directors used

non-musical criteria in their grading practices, and that nearly all directors considered student


attendance and participation as appropriate and important measures. Other assessment practices

included the use of performance tasks (as found by Hanzlik in 2001), practice cards, and teacher-

devised rubrics. Overall, Sears found that most respondents’ schools required teachers to

consider existing state frameworks in assessment planning and usage.

In 2005, Kotora examined assessment strategies used by high school choral music

teachers and taught by college methods professors in Ohio. After surveying 246 high school

choral teachers (43% return rate), he found that non-musical criteria were frequently utilized;

attendance and participation were used as assessment criteria by 85% of choral teachers, with

attitude used by 74% of choral teachers. College methods professors (n = 20; 53% return rate)

also used non-musical criteria, but to a lesser degree (attendance, 55%; participation, 45%;

attitude, 35%). Musical criteria, including concert performances, singing and written tests,

audiotape recordings, and individual performances, were all used by at least 68% of the

participants. Kotora also collected information about whether the decision to use various

assessment practices was impacted by district, state, or national mandates, or personal choice.

Most participants reported that personal choice guided their decision to use specific assessment

practices. Kotora concluded that the profession should seek to provide clarity in the assessment

options available to teachers, increase the availability of technology, and conduct research to

explore the factors that inform teachers’ decisions to use specific assessment strategies.

Drawing from a random national sample, Kancianic (2006) surveyed 2,000 high school

band directors about their assessment methods, purposes, and classroom characteristics. He

received 634 (31.7%) completed questionnaires from participants. The independent variables

comprised personal (11 items) and school (11 items) characteristics. The stated dependent

variables were 23 assessment methods, 19 assessment purposes, and 23 factors influencing the


use of classroom assessment (e.g. logistics, performance expectations, administrative demands,

etc.). Kancianic found that directors tended to employ performance task assessments, but fewer

academic assessment practices. Directors cited logistical reasons as the primary impediment to

utilizing other assessment practices, echoing findings reported by McClung (1996), Hill (1999),

and Simanton (2001). Kancianic reported that none of the MANOVA analyses were found to be

significant for group differences based upon assessment methods, purposes, or factors, likely due

to the large number of variables at play.

Sherman (2006) investigated high school band directors’ assessment and grading

practices, the current tools employed in conducting assessments, and attitudes toward such

practices. Utilizing a random sample of 500 high school band directors from the NAfME Eastern

Region, Sherman surveyed instructors about the kinds of assessments used, including

performance tasks, writing samples, and student self-evaluation. The second phase of this

dissertation involved interviews with directors who participated in the survey. In the interviews,

participants described their use of assessments. Many did employ performance tasks, rubrics, and

portfolios, but used them infrequently, citing time and workload challenges; “with 160 students

to grade, it is impossible to perform any type of assessment more than once as I need to give up a

full week of teaching to do them” (p. 60). More often, Sherman found, teachers incorporated

non-musical criteria in grading practices, specifically, class participation, effort, and class

preparation. However, teachers also reported feeling that they needed to justify their practices;

“If I don’t have a strong background and supporting data for my strategy, I will be eaten alive in

this community. Any hint of ambiguity will be challenged. I have no option but to CYA at every

corner. That is what dictates my grading procedure, nothing else” (p. 61). Sherman’s study


confirmed prior researchers’ findings that classroom realities often impede teachers’ abilities to

formally assess students with frequency.

LaCognata (2010) studied current high school band directors’ assessment practices and

beliefs about assessment. Using a survey, LaCognata sampled 5,000 directors who held NAfME

(formerly MENC) membership; a total of 454 completed the survey (10% of the usable sample,

after undeliverable emails were removed). Overall, LaCognata found that “the main purpose of

student assessment for high school band directors centered on providing their students and

themselves with feedback concerning the instructional process in the classroom” (p. 10). This

represents a fundamental shift in thought from the earliest dissertations by McClung (1996), Hill

(1999), and Simanton (1999), whereby the researchers found that the summative purposes of

education (and related types of assessments) were cited more frequently. However, for the

directors in LaCognata’s study, nonmusical criteria still represented a significant component of

teachers’ grading practices.

In a 2010 study, Russell and Austin surveyed 352 secondary music teachers in the

NAfME Southwest region about specific assessment and grading practices as well as any

contextual or individual difference variables that may influence their practices. They found, like

LaCognata, that while the use and weighting of non-musical criteria in assessing and grading

was less pronounced when compared to studies conducted in the 1990’s or early 2000’s, such

criteria still accounted for the preponderance of students’ grades. Specifically, non-musical

criteria (i.e., attendance, attitude, practice, participation) accounted for an average weighting of

60% of student grades. Within non-musical criteria, Russell and Austin found that directors used

a combination of “subjective impressions and objective documentation to assess” factors (p. 44).

They also found that teachers were most likely to use traditional (i.e., written), in-class


approaches for assessing knowledge, and “down-the-line” performance tasks to assess skills (p.

46). Using MANOVA analysis, Russell and Austin also found that middle school choral

directors gave more weight to written assessments than their instrumental colleagues but found

no such difference between vocal and instrumental high school directors. High school directors

gave greater weight to attendance than did middle school directors, and choral high school

directors gave greater weight to attitudinal factors than did their instrumental colleagues. Russell

and Austin concluded that music teachers may face “the greatest challenge in moving their

assessment paradigm out of the 20th century” if their assessment practices continue to lag behind

assessment principles espoused by experts and framed in policy.

As previously discussed, most researchers have examined music teachers’ assessment

practices and beliefs in secondary and/or instrumental contexts. Fewer have considered other

levels or specializations (e.g., middle school, elementary, vocal, or general contexts). McQuarrie

and Sherwin (2013) recently explored the relationship between elementary music teachers’

assessment practices and assessment topics in literature aimed toward them. First, they collected

data from 100 elementary general music teachers in the Northwestern United States about their

assessment practices. Then, they reviewed ten years (1999-2009) of the national publications

Teaching Music and Music Educators Journal for articles related to classroom music assessment.

Finally, they ranked the reported classroom assessment techniques, and those found in the

literature, by frequency. They found that there is a discrepancy between the assessment strategies

reported as being used most frequently by elementary music classroom teachers and those

espoused in the literature, matching the findings of Austin and Russell (2010).

Austin and Russell (2017) recently explored the role that teacher occupational identity

and assessment conceptions may play in assessment practices. Similar to their 2010 study, they


asked 423 secondary music teachers in the United States about assessment practices. In

alignment with their previous findings, they found that the most heavily weighted grading

criteria were performance skills (35%), followed by non-musical criteria (attitude, 24%;

attendance, 14%). By exploring assessment conceptions (used interchangeably with beliefs and

valuing throughout assessment literature), Austin and Russell identified that music teachers who

valued assessment were more likely to target musicianship outcomes and utilize a mixture of

formative and summative assessments in their practice than music teachers who did not value

assessment. They, however, did not find that teacher identity was strongly correlated to

assessment conceptions. Their findings warrant further examination of the role of assessment

beliefs in impacting music teacher assessment practice.

...in summary.

Researchers in general education and music education have conceptualized and measured

teachers’ assessment and grading practices in a number of ways. Some have described specific

kinds of assessments (e.g., teacher constructed tests, externally produced tests, rubrics, etc.) that

teachers may use, such as Aschbacher (1999). Other researchers have described teachers’

assessment practices by the cognitive or affective outcomes that may be achieved (McCoy,

1988). Music education researchers have described the purposes for which teachers employ

assessments (e.g., achievement, extramusical). There does not appear to be consensus amongst

researchers about the most effective way to have teachers self-report their assessment practices.

Some researchers, like Austin and Russell (2010, 2017), have asked teachers to estimate the

weight of specific assessments in a grading scheme. Frey and Schmitt (2010), however, asked

teachers to self-report the frequency with which they employed specific assessments. Asking

teachers to report the purposes for which they use assessments may be necessary to


understanding the complex interaction of external and internal factors that inform their

assessment practices, as McMillan and Nash (2000) suggested.

Both general education teachers’ and music teachers’ assessment practices have been

described as lacking (Colwell, 2008; Stiggins 2014). Informal methods of assessment (e.g.,

questioning techniques, “fist to five”, or other participation-based methods of engaging students)

are favored by general classroom and music teachers, as opposed to formal and alternative

methods of assessing (e.g., exit tickets, performance assessments). Music teachers appear to use

nonacademic criteria (e.g., attendance, attitudinal criteria) as the basis for grades to a greater

extent than their general education peers. Both sets of teachers also appear to account for non-

academic criteria in their grading and assessment practices, in part, to motivate students, or

account for effort and growth in students who perform below proficiency. Finally, researchers of

both populations appear interested in examining the role of assessment beliefs in teachers’

educational decision-making processes and assessment practices. McMillan and Nash (2000)

devised a conceptual model that may prove useful in determining how teachers resolve tensions

arising from teachers’ beliefs and literacy, external mandates, and classroom realities, and which

of those factors, if any, influence assessment practices.


Chapter 3

Methodology

In this investigation I used a pretest-posttest control group design to examine the

effectiveness of an online professional development intervention for music educators in changing

assessment literacy, beliefs, and practices. The intervention consisted of a four-week online,

module-based course. A secondary purpose of this investigation was to explore relationships

among music teachers’ assessment literacy, beliefs, and practices.

My research questions were:








and practices?

The independent variable had two levels corresponding to the group or condition to

which participants were randomly assigned (i.e., the control or intervention group). The

dependent variables were assessment literacy, beliefs, and practices. In this chapter, I describe

the research design and intervention, participant sampling, the instruments used to collect data,

and procedures related to data collection and analysis.


Research Design and Intervention

Research Design

I used a pretest-posttest control group design. This design allowed me to determine

whether the intervention was effective in changing music educators’ assessment literacy, beliefs,

and/or practices. Drawing upon a national sample of music educators, I administered a

prescreening questionnaire to determine respondents’ eligibility to participate, requested that

they complete the pretest (including separate measures of assessment literacy, beliefs, and

practices), and then randomly assigned participants to either a control or intervention group

(Figure 3.1). The control group was a true control, in that the control group participants did not

receive professional development or any form of communication from me prior to my

administering the posttest. The pretest-posttest control group design minimizes many threats to

internal validity, including history effects, selection effects, and testing effects, because the

influence of environmental factors can be minimized or controlled for across groups (Adams &

Lawrence, 2019; Campbell & Stanley, 1963).


Figure 3.1

Conceptual Diagram of Study Procedures

Intervention Design

I implemented the intervention, a four week online professional development program,

via Google Classrooms. Google Classrooms is a commercial learning management system, akin

to programs such as Canvas, Blackboard, or Moodle. Participants can access the course,

hereafter called Music Teacher Assessment Workshop (MTAW), using a personal email


associated with Google mail. Google Classrooms are organized by course. Email addresses

associated with professional or institutional accounts may not be authorized to use the Google

Classrooms feature; thus, it was important that participants use a personal email account. In the

platform, I created the course, modules, and invited participants in the intervention group to

participate via email. Participants used a class code to enroll in the course.

Course Organization & Elements

The course was titled Music Teacher Assessment Workshop (MTAW). All courses in

Google Classroom are organized around a home page, called the “Stream.” At the top of the

page were four tabs, including the homepage: Stream, Classwork, People, and Grades (Appendix

D). The Stream is a running collection of updates the instructor has made to the course or

announcements from the instructor. The Classwork tab is the repository for the modules. The

People tab is a directory of the course participants. The Grades tab is a gradebook for the

instructor and an individualized report card for participants. Participants did not receive formal

grades for this course, but I did provide notification after they completed each module. Further,

every module was given a categorical weight of 25%, which helped communicate progress to

participants throughout the workshop.

The Classwork tab is organized by Topic. Google Classrooms uses Topic in the way that

Canvas, Blackboard, or Moodle use modules; that is, each topic serves as an anchor point for

assignments and materials. For the MTAW, there were five topics: Week 1, Week 2, Week 3,

Week 4, and Questions & Comments. Topics were listed in the order that participants completed

them. Additionally, the Classwork tab included a link to a Google Calendar for the workshop.

Participants could sync this calendar to their personal accounts, or reference it within the class to

see when each module, or Topic, should be completed. Participants were provided access to each


Topic on a weekly basis, beginning on March 10th, and ending on April 19, 2020. Participants

had access to the first Topic as soon as they registered and until the conclusion of the first week,

the first and second module during the second week, the first, second, and third module during

the third week, and all four modules during the fourth week.

Each Topic was aligned to a specific STCEAS standard; thus, the assignments and

materials associated with each Topic were also aligned to the standard. There were three

assignments per Topic: an Overview, a Discussion, and a Teacher-Constructed Task. A full

listing of the tasks teachers completed each week can be found in Table 3.1.

Table 3.1. Participant Tasks

Visit # Procedures/Tools Location Estimated Time

Pretest ● CALI questionnaire (15 minutes

● MTAII questionnaire (5 minutes)

● MTABI (5 minutes)

Qualtrics 25 minutes

Week 1

Module

● Review of Materials (1-2 hours)

● Discussion Board (.5-1 hour)

● Teacher Constructed Task (.5 hour)

Google

Classroom

MTAW

Course

2.5-5 hours

Week 2

Module



● Teacher Constructed Task (1 hour)

Google

Classroom

MTAW

Course

3-5 hours

Week 3

Module



● Teacher Constructed Task (1-2 hours)

Google

Classroom

MTAW

Course

3-5.5 hours

Week 4

Module



● Teacher Constructed Task (1-2 hour)

●

Google

Classroom

MTAW

Course

3-5.5 hours

Posttest ● CALI questionnaire (15 minutes

● MTAII questionnaire (5 minutes)

● MTABI (5 minutes)

Qualtrics 25 minutes


Within the Overview assignment, participants could access a collection of five to six

articles and print resources. I asked participants to read all of the materials, and encouraged them

to download articles and materials for their personal use. There was no specific accountability

measure to ensure that participants accessed and read all of the materials. However, the

discussion board each week was based upon one of the readings from each collection. This was

an intentional part of the workshop design. Requiring participants to access all of the materials

could potentially discourage engagement in the workshop, and I hoped to provide differentiation

through levels of engagement.

Each Discussion assignment description directed participants toward the web-based

application Perusall. Perusall is a discussion board that takes place on a text (Figure 3.2). Each

week, participants responded to discussion questions corresponding to one of the required

materials, and comments from their peers. This application allows participants to engage in

discussion directly on a text, rather than in a discussion board one step removed from a text.

Participants can add emoji, pictures, video, hyperlinks, or most other media to their comments;

this platform is highly participatory and authentic to digital learning contexts, which researchers

have found to be more effective than traditional formats (Boling et al., 2012; Cook et al., 2017).

Further, providing opportunities for teachers to work together has been shown to increase

satisfaction with online professional development, and the likelihood that teachers will complete

online programs (Yurkofsky et al., 2019). While this discussion activity through Perusall was

interactive, participants’ comments were anonymous to one another, but visible to myself as the

instructor. This precaution was taken to reduce any possible biased or negative interaction

between participants, or the unlikely (given the use of a national sample) possibility of

participants knowing one another professionally or personally.


Figure 3.2

Perusall Discussion Board on an Assessment Text

The Teacher-Constructed Task was directly tied to a corresponding STCEAS standard

each week. Over the course of the professional development, teachers selected a cohort of

students for whom they could design an assessment, developed an assessment (informed by

sound assessment principles), interpreted their assessment data, and used the results for future

educational decision-making. Each week, teachers submitted either a written account of how

they completed the task, a physical copy of the assessment they designed, or both. Therefore, this

assignment cumulatively kept the course grounded in participants’ classroom practice, and

connected theory to practice (Darling-Hammond, 2002). Due to the COVID-19 pandemic, which

resulted in nation-wide closures of schools and a rapid transition to online and distance learning

formats (UNESCO, 2020), I altered the task requirements for the third and fourth tasks. I

provided all teachers with a dataset for purposes of interpretation given that they could not

feasibly collect actual student data; all related materials can be found in Appendix D, and an

exemplar of a completed teacher-constructed task is in Appendix M. Then, using the data

participants were provided, they reflected upon what their next steps would be instructionally,

and how they may change the assessment for future use.


As a form of reciprocity to control group participants, I offered the workshop again at the

conclusion of the study, from June 1 through June 28, 2020. Finally, I planned to present a

summary report of my findings from this investigation to all participants in late summer 2020.

Population and Sampling

My initial plan for conducting this study involved working with a medium-size school

district in Maryland that employed 50 music teachers across 17 elementary/K-8 schools and

seven middle/high schools. Only two music teachers, however, agreed to participate in the study.

Therefore, I abandoned the initial plan and decided to utilize the National Association for Music

Education’s Survey Research Assistance program to disseminate the study description and

informed consent documentation via email to a national population of approximately 20,000

music educators in the United States. I delimited the target population to active K-12 music

educators of any interest area (i.e., band, orchestra, choral, technology, etc.) with NAfME

membership. Because of music educators having to open and read an email invitation and

complete a prescreening questionnaire before they could participate in the study, participants

constituted a volunteer sample.

Following IRB approval of my study on February 12, 2020, NAfME approved and

distributed my project description and prescreening questionnaire on March 10, 2020, with the

intent to resend the email one week later. The purposes of the prescreening questionnaire were to

(a) ensure that all participants met my aforementioned criteria for participation (n.b. NAfME

members include preservice teachers, retired teachers, and college-level teachers), (b) provide

prospective participants with a detailed description of the study, (c) obtain informed consent, (d)

collect demographic information from participants, and (e) randomly assign participants to the

control or intervention group. Due to the impact of the emerging COVID-19 pandemic, NAfME


halted all research after March 17, 2020. Thus, my sample was limited to the music educators

who completed the prescreening questionnaire during the first week. The Research Survey

Assistance program provided information about participant access of the email. The email,

containing the project description and link to the prescreening measure (Appendix I), was sent to

20,474 members. Of that number, 604 emails were rejected by the members’ account, thereby

reducing my accessible population to 19,870. Of that number, 6,309 opened the email, and 247

clicked on the link to the prescreening questionnaire. Finally, of the 247 members who accessed

the prescreening measure, 108 completed the measure in full (i.e., informed consent and

demographic information). I used features in the Qualtrics survey design to randomly assign the

108 participants who completed the prescreening measure to the control or intervention group;

because participants and myself were aware of the group to which they were assigned, this was

not a blinded design. However, in simple random assignment, each participant has an equal

chance of experiencing either condition (Adams & Lawrence, 2019, p. 288).

Selection and Development of Research Measures

Assessment literacy is an adaptable and applicable understanding of the components and

uses of assessment in educational decision making (Stiggins, 1991). To be assessment literate,

teachers must have a firm understanding of sound assessment principles — selecting, designing,

implementing, scoring, and interpreting assessments — and how to use subsequent data in

educational decision making. Quantitative education researchers have traditionally measured

assessment literacy using the seven Standards for Teacher Competence in Educational

Assessment of Students (STCEAS) as benchmarks for subscales (Impara et al., 1993; Mertler,

2000; Mertler & Campbell, 2005). To measure music teacher assessment literacy, I selected the

Classroom Assessment Literacy Inventory (CALI). The CALI and related measures (TALQ, ALI)


have been used extensively in studies of preservice and inservice teachers’ assessment literacy.

While there are continuing concerns about the internal consistency and construct validity of the

CALI when used with inservice teachers, it was the instrument that could best measure

assessment literacy in a contextualized manner (i.e., assessment principles grounded within

realistic vignettes) and be most readily adapted for use with music teachers working within a

range of school settings and sub-specialties (Mertler, 2002). The psychometric properties of the

adapted CALI are reported in Chapter 4.

Assessment practices are the specific methods teachers use to gather or elicit information

about student learning. Researchers in general education and music education have developed

numerous ways to describe and quantify assessment practices (Rowan & Correnti, 2009). In

general education, researchers have described assessments based upon whether they were

teacher- or publisher-constructed (McMillan, 2001), format (Oosterhof et al., 1995), their

purpose (Box et al., 2015), and even specific assignment types (Schmitt & Frey, 2010). In music

education, McCoy (1988) categorized assessment practices based upon the ways that students

may be engaged (e.g. cognitive, psychomotor, affective, and non-music), while Russell and

Austin (2010) categorized assessments based upon achievement or non-achievement criteria.

Regardless of the nomenclature, researchers do tend to use frequency ratings to quantify

assessment practice (Rowan & Correnti, 2009). I developed an instrument -- the Music Teacher

Assessment Implementation Inventory (MTAII) -- that participants used to report the frequency

with which they employed specific forms of assessment, and used assessments for specific

functions (e.g., summative, formative, diagnostic, placement, and extramusical). I believed it was

important to ask music teachers how often they employ assessment for various functions because


researchers have found that many music teachers use assessment informally and to evaluate

extramusical criteria (e.g., participation, attitude, attendance).

Assessment beliefs comprise the values, conceptions, and attitudes teachers hold about

and toward assessment; they “merge affect and concept” (William, 1979 as cited in Fulmer et al.,

2015, p. 478). Educational researchers, particularly Brown, have studied teacher assessment

beliefs using a priori designations with teaching populations worldwide. Generally, researchers

have found that some teachers hold contradictory beliefs about the purposes and value of

assessment; they may perceive assessment as useful and necessary for instructional feedback and

accountability, but also as potentially irrelevant, or even detrimental, to their educational

decision making (Austin & Russell, 2017, 2019; Opre, 2015). Based upon existing research,

music teachers may hold different conceptions of assessment from the general teaching

population as a result of the specialized nature of the discipline. Thus, I chose to utilize an

existing measure specifically designed for music teachers to examine the beliefs participants held

about the purposes of assessment. This measure, the Music Teacher Assessment Beliefs Inventory

(MTABI), was adapted and evaluated by Austin and Russell (2017) from Brown’s TCoA

questionnaire.

Data Collection Instruments

Prescreening Questionnaire and Informed Consent

I distributed the invitation to participate in the study via email through NAfME’s

Research Survey Assistance program. The prescreening questionnaire contained four sections

(Appendix E). First, respondents were asked if they were a current K-12 music teacher to ensure

they were part of my target population. Respondents who selected “no” were directed to a

message thanking them for their interest, and informing them they were not eligible to participate


in the study. I used the “avoid ballot box stuffing” feature in Qualtrics to ensure that these

respondents would not access the prescreening measure again from the link they were provided.

Respondents who selected “yes” were directed to the informed consent document. In addition to

providing consent, participants provided their school email, which was used to link their

responses throughout the study. Third, participants responded to a series of 11 items that

addressed demographic characteristics and educational experience.

Assessment Literacy

The CALI, as well as the MTAII and MTABI, constituted the pretest questionnaire. I

adapted a measure previously used for determining the assessment literacy of varied teacher

populations -- the Classroom Assessment Literacy Inventory or CALI (Appendix F). I selected

the CALI as opposed to other assessment literacy instruments (i.e., TALQ, ALI, ALICE, or

tKUDA) for reasons discussed in the above section. The fact that the CALI often exhibits

marginal reliability when used with inservice teachers may be due to the number and nature of

items (i.e., how the measure is constructed), characteristics of teachers completing the CALI (i.e.,

most music teachers have minimal background of education in assessment), and the type of

information it is designed to provide (i.e., literacy measured in terms of the number of items

answered correctly across distinct standards). According to classical test theory, research

instruments will produce optimally reliable outcomes when there are an extensive number of

items of similar difficulty, when the instruments are administered to a sample that is large and

diverse, and when the measure represents strongly related dimensions of an attribute or construct

(Thompson, 2010). The original instrument (Appendix C) was constructed to reflect the

assessment competencies articulated in the STCEAS. There were 35 items, and seven dimensions

or subscales (i.e., each subscale corresponding to a standard, with 5 items per subscale). Each


item consisted of a vignette or scenario. Respondents selected one of four options that they

believe best answers the question. Table 3.2 includes sample items from each of the seven

dimensions (i.e., subscales) of assessment literacy measured by the CALI.

Table 3.2. Sample Items for Subscales of Assessment Literacy in the Original CALI

Assessment Literacy Dimension Sample Item

Teachers should be skilled in choosing

assessment methods appropriate for

instructional decisions.

Mrs. Bruce wished to assess her students' understanding of the

method of problem solving she had been teaching. Which

assessment strategy below would be most valid?

Teachers should be skilled in developing

assessment methods appropriate for


Ms. Gregory wants to assess her students' skills in organizing

ideas rather than just repeating facts. Which words should she

use in formulating essay exercises to achieve this goal?

The teacher should be skilled in

administering, scoring and interpreting the

results of both externally-produced and

teacher-produced assessment methods.

Many teachers score classroom tests using a 100-point percent

correct scale. In general, what does a student's score of 90 on

such a scale mean?

Teachers should be skilled in using

assessment results when making decisions

about individual students, planning teaching,

developing curriculum, and school

improvement.

Ms. Camp is starting a new semester with a factoring unit in her

Algebra I class. Before beginning the unit, she gives her students

a test on the commutative, associative, and distributive properties

of addition and multiplication. Which of the following is the

most likely reason she gives this test to her students?

Teachers should be skilled in developing

valid pupil grading procedures which use

pupil assessments.

A teacher gave three tests during a grading period and she wants

to weight them all equally when assigning grades. The goal of

the grading program is to rank order students on achievement. In

order to achieve this goal, which of the following should be

closest to equal?

Teachers should be skilled in communicating

assessment results to students, parents,

other lay audiences, and other educators.

In a routine conference with Mary's parents, Mrs. Estes observed

that Mary's scores on the state assessment program's quantitative

reasoning tests indicate Mary is performing better in

mathematics concepts than in mathematics computation. This

probably means that:

Teachers should be skilled in recognizing

unethical, illegal, and otherwise

inappropriate assessment methods and uses of

assessment information.

Mrs. Brown wants to let her students know how they did on their

test as quickly as possible. She tells her students that their scored

tests will be on a chair outside of her room immediately after

school. The students may come by and pick out their graded test

from among the other tests for their class. What is wrong with

Mrs. Brown's action?

For the purposes of this study, I adapted the CALI in two important ways. First, I reduced

the number of subscales from seven to four (and thus, the number of items from 35 to 20), and


altered the vignettes to reflect contexts familiar to music teachers while retaining the content

objective associated with each item (Table 3.2). Secondly, all items in this measure were

randomized to reduce the likelihood of a testing effect (i.e., pretest and posttest versions of the

questionnaire functioned as parallel forms of the same measure).The decision to delimit the

number of items and subscales in the CALI was based upon several factors: (a) researchers have

identified the first four standards as the most likely to be improved through professional

development with inservice teacher populations (Mertler, 2005, 2009), (b) the first four standards

address competencies that are grounded in classroom practices and educational decision making,

(c) music teachers, specifically, are reported to need training in classroom assessment principles

(Russell & Austin, 2010), and (d) my interest, as the researcher, in designing a professional

development intervention that would be viewed as both meaningful and manageable, thereby

reducing participant attrition.


I used a researcher-developed measure, the Music Teacher Assessment Implementation

Inventory (MTAII), to evaluate the frequency and manner with which music teachers’ employ

specific forms of assessment, and the functions for which they use assessment in their

classrooms. This measure consisted of two parts (Appendix G). In the first section, respondents

self-reported how often they utilized specific assessment forms. In the second section,

respondents self-reported how often they used assessments for a given function.

The MTAII was developed after consulting prior research by both general education and

music education researchers. This measure consisted of two matrices with five items rated on a

5-point frequency scale of “Never” to “Almost Always.” To frame responses, I asked

participants to reflect upon a class that represented the most typical example of their teaching


area. Then, I asked them to self-report the frequency with which they had used specific forms of

assessment in the last four weeks. I determined four weeks would be the longest amount of time

teachers could accurately self-report their assessment usage through recall, or checking relevant

lesson plans and/or their gradebook. Eight forms of assessment were presented: written

tests/quizzes, written classwork/homework, group performance, individual performance,

projects, portfolios, attendance, and participation. In the second matrix, participants were asked,

using the same framing and scale, the frequency with which they had employed assessments for

a given function within the last four weeks. I presented five assessment functions (e.g.

summative, formative, diagnostic, placement, and extramusical) common to music teachers’

practice. For the posttest, I slightly modified the MTAII; rather than asking participants to reflect

upon the previous four weeks, I asked them to project the frequency with which they would use

specific forms of assessment, and for what functions, in the four weeks to come.

Assessment Beliefs

Researchers have speculated that, due to the contextual realities of their jobs, music

teachers may hold conceptions of assessment that differ from the general teaching population

(Russell & Austin, 2010). While Brown’s TCoA questionnaire has been validated for a general

teaching population, the Music Teacher Assessment Beliefs Inventory (MTABI) is a measure

adapted for use with music teacher populations (Austin & Russell, 2017). This measure consists

of 17 (nine positively-phrased, eight negatively phrased) characterizing assessment of music

learning in terms of importance, value, relevance, and trustworthiness; participants were directed

to indicate their level of agreement with each statement using a six-point Likert type scale

(Appendix H). In their study, Austin and Russell (2017) reported high internal consistency (α =

.92) for their measure.


Intervention Group Posttest

The posttest was identical to the pretest with the exception that music teachers assigned

to the intervention group were asked to respond to five open-ended questions about their

experience completing an online professional development program. Responses to these

questions were ancillary to the main research questions, but provided me, as the researcher, with

some basis for contextualizing the findings and judging the efficacy of the intervention. As with

the pretest, items were randomized within each section (corresponding to the CALI, MTAII, and

MTABI) to control for a testing effect and response set.

Procedures

Pilot Testing

I pilot tested the pretest questionnaire (inclusive of the CALI, MTAII, and MTABI

measures) between January 6, 2020 and January 31, 2020 using a sample of 28 Colorado music

educators. I viewed pilot testing as an opportunity to establish the face validity of the research

measures, refine certain items, and gauge the user experience (e.g., the time and effort required

to complete the questionnaire). Face validity is an informal way of evaluating whether a measure

appears appropriate to assess a construct (Lawrence & Adams, 2019). In addition, I sought input

from several music teacher educator colleagues about the layout and accessibility of the online

professional development course. Based upon user feedback, I modified the wording of two

questions in the assessment literacy measure, and limited the number of items per page to five.

Further information about the psychometric properties of the CALI and MTABI measures can be

found in Chapter 4; because the MTAII was designed to capture purely descriptive data about

music educators' assessment practices, psychometric analysis was not warranted.


Prior to implementing the main study and gathering data, I completed the IRB procedures

for all human subject research under the auspices of the University of Colorado Boulder. I was

granted IRB approval to conduct the study on February 13, 2020. Then, I applied for NAfME’s

Research Survey Assistance, which was approved on March 6, 2020 (Appendix A).

Data Collection

The solicitation email, informing target population members about the study and

encouraging them to participate, was distributed on March 10, 2020. While my original intention

was for NAfME to send a follow-up email the very next week, NAfME ceased all research

activities on March 17, 2020 due to the COVID-19 pandemic. I did, however, use the emails of

people who had provided informed consent but not completed the pretest measure, to send a

second wave of solicitation emails on March 18, 2020. Participants (N = 108) who completed the

prescreening questionnaire and provided informed consent were automatically randomly

assigned to the control or intervention group (through an automated feature within Qualtrics),

and then emailed the pretest measure by using an automated trigger email function (Appendix J).

For those in the intervention group who completed the pretest measure, another automatic email

was sent containing instructions for registering to take the MTAW. I closed the pretest measure

on March 23, 2020. A total of 74 participants completed the pretest (39 in the intervention group,

35 in the control group).

During the subsequent four weeks, from March 23 through April 19, 2020, I used Google

Classrooms to provide music teachers assigned to the treatment group a professional

development (PD) experience focused on assessment. Each week, I sent PD participants an email

describing the topic and tasks to be covered within the module. I monitored the class daily, in

order to respond quickly to participant comments and questions about the intervention. I also


reviewed participants’ submissions and provided informal, formative feedback in written form,

as would be customary for most multi-week PD offerings. At the conclusion of the intervention I

emailed all participants, including those in the control group, the link to the posttest (Appendix

J). The posttest was kept open until April 30th, and I sent two subsequent reminder emails to

participants to complete the posttest.

Data Analysis

On April 30th, I closed the posttest and downloaded the data files from Qualtrics. A total

of 18 intervention group participants completed the posttest, along with 25 participants who had

been assigned to the control group. All files were compiled into one database where participant

responses were linked through their school email. Data were analyzed using the Statistical

Package for Social Sciences (SPSS, Version 26.0.0.0 for Windows 10, 2019). The file contained

a total of 179 variables, including demographic information (13 items), as well as pretest and

posttest responses to the 20 assessment literacy items, the two assessment practice matrices, and

the 17 assessment belief items. I recoded the assessment literacy items to reflect their ipsative

nature by using “1” for correct responses and a “0” for incorrect responses. This allowed me to

compute total participant scores on the assessment literacy items, as well as difference scores

between participants’ performance on the pretest and posttest. I also reverse coded participant

responses to the negatively phrased items within the assessment beliefs matrices, and then

summed responses across items to compute a total assessment beliefs score. Then, I computed a

difference score for participants’ assessment beliefs. Finally, I summed participants’ responses to

individual items on the MTAII, and computed difference scores for each item.

After coding responses and creating composite measures and difference scores, I used

descriptive statistics to summarize participant responses and assess the normality of responses to


the assessment beliefs measure. Then, I estimated the internal consistency of the CALI

instrument (using a Kuder-Richardson 20 procedure) and the MTABI instrument (using

Cronbach’s alpha). I also used Pearson correlations to analyze the relationships among CALI

items, their respective subscales, and participants’ total scores on the pretest.

To determine whether the professional development intervention had a significant effect

on music teacher assessment literacy and beliefs, I used a multivariate analysis of variance

(MANOVA) with assigned group serving as the independent variable and assessment literacy

and assessment belief difference scores (posttest mean minus pretest mean) serving as the

dependent variable set. Historically, the use of difference scores to determine whether an

experimental manipulation or educational intervention has had an intended effect has been

discouraged by some statisticians and psychometricians. In recent years, however, scholars have

qualified such criticisms depending on the kind of design employed and the nature of the

research question (Henson, 2001; Thomas & Zumbo, 2012; Thompson, 2010). For example,

Thompson (2010) objected to the use of repeated measures ANOVA in designs where there is

only one posttest, and argued that difference scores were more accurate in determining the

efficacy of an intervention in such designs. Thomas and Zumbo (2012) demonstrated

mathematically why there is little difference in the statistical outcome between using difference

scores in a MANOVA and using a repeated measures analysis. They stipulated, however, that

there might be slight differences in the significance level and effect size.

In addition to the MANOVA for intervention effects on music teacher assessment literacy

and beliefs, I used a Mann Whitney U procedure to compare assessment practices of music

teachers representing the intervention and control groups. This was necessary because the data

for participants’ assessment practices did not meet the assumption of normality required for the


parametric equivalent. Upon examination of the descriptive statistics from participants’

responses to the assessment practice items, it was apparent that participants interpreted the items

as ordinal rather than intervallic in nature (i.e., that the differences between points on the scale

were not equal). This further reinforced my decision to use a Mann Whitney U test to compare

participants’ pretest and posttest frequency reports for assessment practices.

Finally, I used Pearson correlations and Spearman’s Rho analyses to answer the fourth

research question and explore possible relationships between music teachers’ assessment

literacy, beliefs, and practices. A Spearman’s Rho operates similarly to the Pearson correlation,

but can be used when one or more variables being correlated reflect ordinal measurement.

Additionally, while “in parametric correlation the relationship between the two variables should

be linear, the relationship between any two variables being examined through Spearman

correlation should be monotonic” (Russell, 2018, p. 292). That is, the relationship between the

variables should demonstrate mutual growth or decrease, or an inverse relationship. Because the

assessment practices data were ordinal, this test was the most appropriate analysis to use. I

utilized both participants’ pretest and posttest scores for the measures; specifically, the summed

pretest and posttest literacy scores, the summed pretest and posttest assessment belief scores, and

the self-reported pretest and posttest mean scores of assessment practices.


Chapter 4

Results

Nearly 20,000 inservice K-12 music educators were solicited for study participation via

the National Association for Music Education’s (NAfME) Research Survey Assistance program

in March 2020. A follow-up email was not distributed after NAfME halted all research assistance

on March 18th due to the COVID-19 pandemic. Of the NAfME members who received the

invitation email, 247 educators accessed the link describing the study, 108 provided informed

consent and were randomly assigned (through a feature within Qualtrics) to either the control or

the intervention group, 74 completed the pretest in full, and 43 completed the study in full. Using

a feature in Qualtrics, I randomly assigned participants to approximately equal groups.

I analyzed data using IBM SPSS Statistics (Version 26.0). First, I sought to establish the

extent to which the control and intervention groups were both representative and equivalent in

terms of participant characteristics. Second, I estimated the reliability of the Classroom

Assessment Literacy Inventory (CALI) and the Music Teacher Assessment Beliefs Inventory

(MTABI), and also conducted item-level difficulty and discrimination analyses for the CALI.

Third, I produced descriptive statistics for participants’ demographic features and responses to

each of the three measures: the CALI, the Music Teacher Assessment Implementation Inventory

(MTAII), and the MTABI. Finally, I conducted difference testing (i.e., MANOVA, Mann-

Whitney U) to answer the research questions. The following chapter is organized in the four

sections described above. The research questions addressed group differences over time and

relations among assessment literacy, beliefs, and practices; thus, all analyses associated with the

major research questions were based upon the total number of participants who completed the

study.


Participant Demographics

Participant characteristics can be found in Table 4.1. A volunteer sample of forty-three

music educators from a target population of roughly 20,000 NAfME members currently teaching

music in K-12 schools completed the requirements of the study. Of those, 25 (58.1%)

participants represented the control group, and 18 (41.9%) the intervention group. Twenty-seven

of the participants were female (62.8%), and 16 were male (37.2%); within the respective

experimental groups, there were 15 females and 10 males in the control group, and 12 females

and six males in the intervention group. I found, using a chi-square test of independence, no

significant association between assigned group and gender [ꭗ2 (1) = 0.19, p = .655]. A non-

significant association implies equivalence of groups.

Within this sample, 37 of the participants identified as Caucasian or non-Hispanic (86%),

while the remaining six identified as one of the four other ethnicities. In the control group, 22

participants identified as Caucasian, one as Hispanic or Latinx, one as American Indian, and one

as bi- or multi-racial; within the intervention group, 15 identified as Caucasian, one as Black or

African American, and two as bi- or multi-racial. Using a 2 x 2 chi-square test of independence

with races and ethnicities collapsed to Caucasian and non-Caucasian, I found no significant

association between assigned group and race/ethnicity [ ꭗ2 (1) = 0.19, p = .663].

Of the 43 participants, 19 taught elementary grades (44.2%), one taught grades K-8

(2.3%), eight taught exclusively middle school grades (18.6%), ten taught a mixture of grades 6-

12 (23.3%), four taught exclusively high school grades (9.3%), and one taught all K-12 grade

levels (2.3%). Within the control group, 13 participants taught some combination of elementary

grades, four taught middle school grades, and eight taught a mixture of middle and high school

grades. In the intervention group, eight taught mostly elementary grades, four taught middle


school grades, and six taught mostly high school grades. After collapsing these descriptions into

“elementary”, “middle”, and “high” levels to ensure cells had a minimum count of five, I found

no significant association between the assigned group and grade levels taught [ꭗ2 (2) = 0.35, p =

.840].

Participants taught a mixture of musical subjects. Fifteen participants taught chorus or

vocal courses (34.9%), 16 taught band (37.2%), six taught orchestra (14%), nine taught an

instrumental ensemble other than band or orchestra (e.g., guitar, jazz band, marching band, etc.;

20.9%), 27 taught general music (62.8%), two taught music appreciation (4.7%), two taught

music theory (4.7%), and three taught in a visual and performing arts program (7%). Due to the

number of courses present, I could not collapse descriptors into a reasonable number to conduct a

chi-square analysis; however, groups had comparable counts of the three most prominent

ensemble courses (i.e., band, chorus, and orchestra), alternative ensembles, and other courses.

There was a large range, from 2 to 39 years, of teaching experience within this sample (M

= 14.37, SD = 9.14. Within the control group, participants averaged 15.20 years of experience

(SD = 8.99), and those in the intervention group averaged 13.22 years of experience (SD = 9.48).

Using an independent samples t-test, I found no significant difference between experimental

groups based upon participants’ average years of teaching experience [t(41) = .002, p = .491].

With regard to educational credentials, thirteen participants reported holding a Bachelor’s

degree (30.2%), 18 held a Master’s degree (41.9%), 11 had acquired an additional 30 credits

after a Master’s degree (25.6%), and one had completed a Doctoral degree (2.3%). I collapsed

the post-baccalaureate degree category to include the Master’s, “+30” and Doctoral degree, and

found no significant association between assigned group and level of education [ ꭗ2 (1) = 0.09, p =

.766].


Finally, I asked participants five questions about their prior experiences with assessment,

and general preparedness to teach music and assess student learning. Only 11.6% (5) reported

having a specific course devoted to assessment in their undergraduate coursework. On a six-point

Likert type scale ranging from “Very Unprepared” to “Very Prepared”, participants, on the

whole, reported feeling prepared to be a music teacher after their undergraduate coursework

(72.09%), while only 18 (41.9%) reported feeling prepared to assess student learning. A majority

of participants also reported that they had never taken a prior workshop on assessment (67.7%),

or any kind of course on assessment following their undergraduate coursework (76.7%).

Table 4.1. Descriptive Statistics for Participant Demographics (N= 43)

n % n %

Experimental Group Courses Taught

Control Group 25 58.1 Vocal/Choral 15 34.9

Intervention Group 18 41.9 Instrumental Band 16 37.2

Gender Instrumental Orchestra 6 14.0

Female 27 62.8 Instrumental Other 9 20.9

Male 16 37.2 General Music 27 62.8

Race & Ethnicity Music Appreciation 2 4.7

Caucasian or Non-Hispanic 37 86.0 Music Theory 2 4.7

Black or African American 1 2.3 Visual & Performing Arts 3 7.0

Hispanic or Latinx 1 2.3 Educational Background

American Indian or Alaska Native 1 2.3 Bachelor's degree 13 30.2

Biracial or multi-racial 3 7.0 Master's degree 18 41.9

Grade Level Master's +30 11 25.6

Elementary 19 44.2 Doctoral degree 1 2.3

Elementary + Middle 1 2.3 Years Teaching Experience

Middle 8 18.6 1-10 16 37.2

Middle + High 10 23.3 11-20 19 44.2

High 4 9.3 21-30 6 13.9

K12 1 2.3 31-40 2 4.7


Reliability and Item Analysis

Given the importance of research measure fidelity in providing a trustworthy test of an

educational intervention, I estimated the reliability of the adapted Classroom Assessment

Literacy Inventory (CALI) for a music education context (Mertler, 2004), and confirmed the

reliability of the Music Teacher Assessment Beliefs Inventory (Austin & Russell, 2017). I did not

evaluate the reliability of the Music Teacher Assessment Implementation Inventory (MTAII),

because the measure served a primarily descriptive purpose (i.e., collected scaled frequency data,

rather than participants’ level of agreement with statements, or attitudes). For the purposes of

reliability and item analysis, I utilized the pretest responses of the 43 music educators who

completed the requirements of the study.

CALI Reliability and Item Analysis

In prior studies, researchers have used the Kuder-Richardson 20 (KR20) index to estimate

the reliability of assessment literacy scores measured by the CALI and its related forms (i.e.,

TALQ, ALI). The KR20 index is used to calculate the internal consistency of dichotomously

scored items -- items that may be scored correctly or incorrectly (Thompson, 2010). Normally,

KR20 values range from 0 to 1, with higher values representing a more internally consistent

instrument, and values of .70 or higher considered adequate for research purposes. It is important

to note, however, that this standard is commonly understood to apply to instruments with 50 or

more items of homogenous difficulty (Thompson, 2020, p. 668). Finally, squaring the KR20

index provides an estimation of score variance that did not result from error.

Impara et al. (1993) reported the TALQ reliability estimate as .54 for a national sample of

555 inservice teachers, and Campbell (2002) used the ALI with a convenience sample of 220

preservice teachers, reporting a reliability estimate of .74. In a 2004 study of 67 preservice


teachers and 101 inservice teachers, Mertler reported KR20 estimates of .74 for preservice

teachers and .44 for inservice teachers on the CALI, which he interpreted as “comparable” to

prior researchers’ reliability estimates.

The original CALI measured seven facets of assessment literacy, corresponding to the

seven Standards for Teacher Competence in the Educational Assessment of Students (STCEAS),

using a total of 35 items. In my adapted measure, I reduced the number of facets to four, and a

total of 20 items. I used the KR20 index to evaluate the internal consistency of my adapted CALI

measure. For the sample of 43 music educators who completed all requirements of the study the

KR20 reliability estimate was .29. Such a result could indicate poor internal consistency of the

measure. To investigate possible reasons for poor internal consistency I conducted three further

analyses: correlations among scores for individual CALI items, 5-item standards, and the 20-item

CALI measure; difficulty indices for individual items; and discrimination indices for individual

items.

Correlations

I computed scores for each standard by summing the number of items correct out of five

items for each standard. I also summed the items correct across the four standards to arrive at a

total CALI score. I calculated bivariate correlations between scores on each standard and total

CALI scores (Table 4.2). All correlations between total CALI scores and scores for each standard

were of moderate magnitude (r = .42 to .67) and significant at p < .01. Thus, it appeared that

participants’ performance within each standard moderately correlated with their overall test

performance. Next, I correlated participants’ scores on individual items against the total score for

the corresponding standard (Table 4.2). Across all four standards, scores for at least three items

were moderately and significantly correlated to total scores for the standard. Item-standard


correlations were weakest for Item 3.2 and Items 4.2 and 4.5. Based upon these correlations, it is

possible to conclude that there is reasonable alignment from item level responses to scores for

CALI standards.

Table 4.2. Correlations of Item Scores with CALI Standards

(N = 43) Pretest Total Item 1 Item 2 Item 3 Item 4 Item 5

Standard 1 .42** .45** .66** .38* .50** .51**

Standard 2 .67** .36* .46** .46** .36* .70**

Standard 3 .64** .58** .22 .50** .59** .42**

Standard 4 .57** .42** .16 .42** .70** .21

**Significant at p < .01 (2-tailed).

*Significant at p < .05 (2-tailed).

Difficulty and Discrimination Indices

To provide additional perspective on the psychometric quality of CALI items, I conducted

difficulty and discrimination analyses (Table 4.3). A difficulty index is essentially the percentage

of respondents that answered an item correctly, expressed as a decimal. A discrimination index is

a measure comparing the percentage of correct responses from the highest and lowest scoring

groups to evaluate if the item discriminates between the respondents scoring highest and lowest

overall (Salkind, 2018, p. 166). The formula for calculating a discrimination index is d = Nh - Nl /

(p)T, where d is the discrimination index, Nh is the number of respondents in the high scoring

group, Nl is the number of respondents in the low scoring group, p is the percentage threshold the

researcher selects to limit the high and low scoring groups to, and T is the total number of

responses for the item. Like a Pearson correlation, it is expressed as a value between 0 (no

discrimination) and 1 (high discrimination), with positive values representing instances where

more respondents in the high scoring group answered correctly than the low scoring group (p.

167). Thus, the discrimination of an item is constrained by its difficulty; an item can only have


perfect discrimination if all of those in the highest scoring group answered correctly, and all of

those in the lowest scoring group scored incorrectly (p. 169), I selected a threshold of the highest

and lowest scoring 20% from the sample (n = 16).

Table 4.3. Difficulty & Discrimination Indices

D d D d

Standard 1 Standard 3

1.1 0.98 0.00 3.1 0.81 0.1

1.2 0.58 0.00 3.2 0.93 0.00

1.3 0.93 0.00 3.3 0.53 0.20

1.4 0.84 0.10 3.4 0.51 0.20

1.5 0.26 0.05 3.5 0.95 0.00

Standard 2 Standard 4

2.1 0.88 0.10 4.1 0.42 0.10

2.2 0.19 0.15 4.2 0.86 0.00

2.3 0.63 0.15 4.3 0.42 0.20

2.4 0.98 0.00 4.4 0.40 0.20

2.5 0.44 0.20 4.5 0.09 0.00

Overall, it appears that there were a number of easier items (i.e., 9 of the 20 items had a

difficulty index >.80) that likely constrained the discriminating power of the CALI, and possibly

contributed to lack of internal consistency as well. It is important to note that item difficulty and

discrimination indices are not always accurate representations of item quality. A certain number

of difficult and easy items may be needed to adequately sample the full range of assessment

literacy among a group of music teachers. Item difficulty and discrimination indices also are

sensitive to the type and number of individuals being tested, as well as random error (e.g., item

ambiguity, clues, or other technical defects).


The two most difficult items for participants (Item 2.2 and 4.5) had limited to no ability

to discriminate between the highest and lowest performing respondents. I performed a distractor

analysis to determine which of the foils participants selected the most frequently. Item 2.2

(Appendix F) asked respondents to determine which strategy would increase the reliability of a

test. Participants most frequently selected (incorrectly) the first foil, which more accurately

defined how participants would increase the validity of a test. Thus, I determined that this item

was difficult because participants might not have known the difference between reliability and

validity. Item 4.5 asked respondents to determine which factor might invalidate comparisons

between scores on standardized tests. Participants almost exclusively selected the first foil, which

speculated that scores might differ on tests in districts more aligned to the standards upon which

the test was based. I determined that participants likely answered this incorrectly because they

did not understand the difference between an invalid comparison of scores and a valid

comparison of scores. A detailed distractor analysis can be found in Appendix K.

MTABI Reliability Analysis

I utilized Austin and Russell’s (2017) instrument for this study; however, rather than

create subscale scores based on factor analysis results, I reverse coded negatively phrased items

and then summed responses across all items to create total MTABI scores reflecting the extent to

which participants adopted a positive orientation toward assessment as an important, valuable,

and trustworthy aspect of music teaching and learning.. Upon request Austin provided

Cronbach's α for their data (N = 406), after reverse coding to match the specifications of this

study; he reported strong reliability for the measure (α = .92). I used the same procedure to

establish the internal consistency of the 17 items for the 43 participants who completed the

pretest, and found their scores to be highly consistent (α = .89).


Participant Attrition

I would be remiss not to note the possible impact of participant attrition on study results;

especially in light of the study occurring during the beginning of the first wave of the COVID-19

pandemic. I initially recruited 108 participants. Of that number, only 74 completed the pretest

measure (31 in the control group, and 43 in the intervention group). From March 18th, when the

pretest was first available, to March 23rd, when the intervention began, 18 participants enrolled

in and later completed the professional development workshop. The posttest was distributed on

April 2th and closed on April 27th; a total of 43 participants completed the measure and the

study (25 in the control group, and 18 in the intervention group). Thus, over the course of the

study, from March 18th until April 27th, approximately 42% of participants dropped out of the

study, with, as might be expected, a greater attrition rate for the intervention group (58%) than

for the control group (19%). To determine whether attrition differentially impacted intervention

group teachers as compared to control group teachers, I conducted a 2 x 2 factorial ANOVA,

with assigned group (intervention, control) and study completion status (completed, not

completed) as the independent variables, and pretest CALI scores as the dependent measure.

There was no significant interaction (F = .27, p = .602) or main effects (F = .13, p = .718, for

assigned group, F = .36, p = .553 for study completion status), which suggests that attrition did

not differentially affect the groups in terms of the assessment literacy they exhibited at the

beginning of the study. I then repeated this analysis with pretest MTABI sum scores as the

dependent measure. Again, I found no significant interaction (F = .07, p = .797) or main effects

(F = .86, p = .358, for assigned group, F = .00, p = .949, for study completion status). Thus, I

concluded that participants’ beliefs were not significantly impacted by attrition.


Descriptive Statistics for the CALI, MTAII, and MTABI

Using SPSS 26, I conducted descriptive analyses of the 43 participants’ responses on the

pretest and posttest CALI, MTAII, and MTABI measures. Results are reported in tables 4.4, 4.5,

and 4.6. Descriptive analyses were required to evaluate that the data met assumptions required to

conduct further inferential statistical analyses.

CALI Descriptives

Participants responded to 20 multiple-choice items. After data collection, items were

recoded to reflect the ipsative (i.e., correct or incorrect) nature of the questions using “1” for

correct responses and “0” for incorrect responses. Subsequently, data were summed for the total

measure, and each of the four STCEAS standards to which they corresponded. I ran descriptive

statistics on these summed scores; the means and standard deviations are reported in Table 4.4.

Table 4.4. CALI Pretest and Posttest Descriptive Statistics (N = 43)

Pretest Posttest

Mean SD Skew Kurtosis Mean SD Skew Kurtosis

Standard 1 3.58 0.91 -0.46 0.43 3.95 0.69 0.06 -0.19

Standard 2 3.12 0.93 0.13 -0.30 3.26 0.82 0.03 -0.80

Standard 3 3.74 0.93 -0.77 0.75 3.93 0.83 -0.13 -0.60

Standard 4 2.19 0.88 0.72 0.10 2.14 1.10 -0.18 -0.93

Total Score 12.63 2.09 0.36 6.19 13.28 1.92 -0.29 -0.19

The general pattern is one of means increasing and standard deviations decreasing over time,

except for Standard 4 (using assessment results to make educational decisions), where the

posttest mean was smaller, and the standard deviation was larger. Further, music teachers

generally appeared to have lower levels of literacy knowledge surrounding Standards 2

(developing and implementing appropriate assessments) and 4.


MTAII Descriptives

I designed this measure after consulting prior researchers' examination of music

teachers’ assessment practices. Within the literature, researchers conceptualized assessment

practices in two ways: by the specific forms of assessment used (e.g., tests, classwork,

performances, etc.) and the purposes for which these assessments were used (e.g., summative,

formative, diagnostic, etc.). I attempted to account for both form and function by asking music

educators to report the frequency with which they used specific forms of assessment and how

often they used assessments to serve varied functions. The scale ranged from 1-6, with options

“Never”, “Less than Once Per Week”, “Once Per Week”, “Several Times Per Week”, “Nearly

Every Day”, and “Always.” Descriptive statistics for the MTAII are reported in Table 4.5.

Table 4.5. MTAII Pretest and Posttest Descriptive Statistics (N = 43)

Pretest Posttest

x̄ SD x̄ SD

Forms Participation 4.18 1.45 3.68 1.48

Group Performances 3.78 1.22 2.94 1.63

Individual Performances 2.94 1.13 2.72 1.11

Attendance 3.14 1.90 2.52 1.76

Written Classwork 1.92 0.85 2.50 0.97

Projects 2.08 1.18 2.16 0.93

Written Tests & Quizzes 1.74 0.78 1.82 0.69

Portfolios 1.22 0.47 1.40 0.93

Purposes Formative 3.70 1.28 3.72 1.34

Diagnostic 3.14 1.49 3.00 1.40

Extramusical 2.44 1.50 2.10 1.18

Summative 2.16 0.84 1.98 0.74

Placement 1.50 0.91 1.50 0.74

*Frequency scale from 1-6 (Never to Always)

**Bolded posttest means highlight increases from pretest means.


Within the eight specific assessment forms comprising music teacher practices,

participants generally reported more frequent use of written tests and quizzes, homework,

projects, and portfolios on the posttest. Participants also reported using individual performances,

group performances, attendance, and participation less frequently in their assessment of student

learning. With regard to the functions of assessment, participants self-reported using fewer

summative, diagnostic, and extramusical assessments, and approximately the same number of

assessments for formative and placement purposes.

MTABI Descriptives

As previously discussed, this measure was adapted from Austin and Russell’s (2017)

study. Participants used a 6-point Likert-type scale with response options ranging from “Strongly

Disagree to Strongly Agree” to rate their level of agreement with 17 statements comprising

possible beliefs about the value and use of assessments. After data collection, I reverse coded

negatively-phrased items. The means and standard deviations (pretest and posttest) for the 17

items completed by study participants are reported in Table 4.6.

Table 4.6. MTABI Pretest and Posttest Descriptives (N = 43)

Pretest Posttest

x̄ SD x̄ SD

Assessment is an important music teacher responsibility 5.05 0.98 5.16 0.75

Assessment and instruction can be seamlessly integrated 4.86 0.99 5.12 0.91

Assessment helps music teachers to be effective 5.00 0.87 5.00 0.85

Assessment has little impact on music teaching⍑ 4.86 1.21 5.00 0.87

Assessment forces music teachers to contradict their beliefs⍑ 4.30 1.32 4.81 0.98

Assessment consistently provides useful information 4.72 0.93 4.70 0.74

Assessment results are rightfully ignored by most music teachers⍑ 4.56 1.03 4.67 1.06

Assessment reduces music teacher creativity⍑ 4.47 1.33 4.65 1.02

Assessment causes music teachers to be conformists⍑ 4.19 1.26 4.56 1.05


Assessment results are of great use to music teachers 4.30 1.41 4.51 1.06

Assessment interferes with teaching⍑ 3.98 1.32 4.47 1.1

Assessment results are dependable 4.09 0.81 4.37 0.79

Assessment results are trustworthy 4.12 0.79 4.35 0.78

Assessment helps music teachers treat their students fairly 4.02 1.32 4.26 1.22

Assessment results are often inaccurate⍑ 4.09 1.13 4.05 0.87

Assessment typically provides precise information 3.74 0.98 3.95 0.95

Assessment results are prone to error⍑ 3.70 1.06 3.86 0.94

Total 74.05 11.59 77.49 9.30

*Agreement scale from 1-6.

** Bolded posttest means indicate an increase from pretest means.

⍑ Negatively phrased items that were reverse coded after data collection.

After reverse coding items and comparing pretest and posttest descriptives, I found

participants began and ended the experiment with an overall positive orientation toward

assessment. While music teachers reported higher levels of agreement with 15 of the 17

statements, these changes were nominal.

Research Questions

The research questions for this investigation were:








and practices?


In the following section, I describe how I met the assumptions necessary for the

parametric and nonparametric statistical analyses I employed, and the results of those analyses.

Multivariate Analysis of Assessment Literacy and Beliefs

To determine whether a four-week online professional development intervention had a

significant effect on music teachers’ assessment literacy and their beliefs about the value of

assessment, I employed a multivariate analysis of variance (MANOVA), with assessment

literacy and belief gain scores (i.e., the change in participants’ scores from pretest to posttest)

serving as the dependent variables. As discussed in Chapter 3, I considered this analysis the

most appropriate procedure for answering the research question -- as opposed to repeated

measures MANOVA or a MANCOVA.

The assumptions for a MANOVA are (Russell, 2018, p.131):

1. The data collected for the dependent variables are continuous, rather than categorical.

2. The data collected for the independent variable are categorical rather than continuous.

3. Each observation is independent of any other observation.

4. All of the dependent variables are normally distributed themselves, and any combination

of dependent variables is normally distributed (multivariate normality).

5. Each of the dependent variables has equal variance when compared to each independent

variable.

6. There is a linear relationship between independent and dependent variables.

All assumptions for this analysis were met. The first three assumptions were met through

study design. The fourth assumption was met through analyzing descriptive statistics, a Shapiro-

Wilk normality test, and visual inspection of Q-Q plots in SPSS 26. I determined that two cases

in my data set were outliers in the CALI difference score variable and used the “Select Cases”


function to omit them from subsequent analyses. As a result, my sample size for this analysis

was 41, with 24 participants in the control group and 17 in the intervention group. After retesting

my dataset, I determined that I met the assumptions for normality. The fifth assumption was met

by conducting a Box’s M test (Box M = 1.56, p = .669). Because the test was not significant, I

determined that each of the dependent variables had equivalent variance in relation to the

independent variable. The sixth assumption was met by generating a scatter plot matrix with

Loess lines for each level within the independent variable, and visually inspecting the matrices

for similar features.

MANOVA Results

Using the gain scores on the two dependent measures (i.e., assessment literacy and

beliefs) as the dependent variable set, I conducted a MANOVA. I found the overall model to be

significant (Λ = .785, F = 5.21, p = .01, ηp2 = .22). To determine which mean differences

contributed to the overall significant multivariate outcome, I subsequently conducted univariate

ANOVA tests. I met the equality of variance assumption for both assessment literacy change

scores [F(1, 39) = 1.255, p = .270], and assessment belief change scores [F(1, 39) = .252, p =

.618]. Based on the univariate ANOVA tests, I determined that there was significant group

difference for assessment literacy change scores [F(1) = 7.731, p = .008, ηp2 = .17], but not for

assessment belief change scores [F(1) = 1.580, p = .216]. Participants in the intervention group

(n = 17, x̄ = 1.41, SD = 2.15) exhibited significantly greater growth in assessment literacy over

time than their peers in the control group (n= 24, x̄ = -.25, SD = 1.67). There was not a

significant change in assessment beliefs between the intervention (n = 17, x̄ = 3.94, SD = 8.26)

and control (n = 24, x̄ = .83, SD = 7.46) groups over time. Figures 4.1 and 4.2 depict the pretest

and posttest scores by assigned group for the literacy and beliefs measures, respectively.


Figure 4.1

Pretest to Posttest Mean Literacy Scores

Figure 4.2

Pretest to Posttest Mean Belief Scores


Nonparametric Analyses of Assessment Practices

To answer the third and fourth research questions, it was necessary to utilize

nonparametric analyses. This is because participant change scores on the MTAII did not meet the

assumptions required to perform parametric analyses. It was apparent that participants

conceived of the frequency scale as ordinal in nature (i.e., that the distances between points on

the frequency scale were not equal).

Thus, I employed a Mann-Whitney U test to determine whether differences existed

between groups for assessment form and assessment function/purposes (i.e., the third research

question). For this analysis, I utilized all 43 participants because I did not need to assume that the

distributions were normally distributed.

The assumptions for a Mann-Whitney U test are (Russell, 2018, p. 270):

1. The independent variable (or grouping variable) is bivariate. That is, the independent

variable is membership in one or another group or category.

2. The data collected for the dependent variable are ordinal or continuous rather than

categorical.

3. Each observation is independent of any other observation (most often accomplished

through random sampling).

All assumptions were met by virtue of the study design and type of data collected.

In order to answer the fourth research question, I used a Pearson correlation to analyze

the relationship between music teachers’ (N = 43) assessment literacy and beliefs, and a

Spearman’s Rho analysis to explore relationships between music teachers’ assessment literacy

and practices, and assessment beliefs and practices. I used pretest and posttest data for all

analyses. The assumptions for a Pearson correlation are like those of a Spearman’s Rho analysis


and differ only in that all data must use scale-level measurement. The assumptions for a

Spearman’s Rho analysis are (Russell, 2018, p. 291):

1. The variables need to be ordinal or continuous (ratio or interval).

2. Whereas in parametric correlation the relationship between the two variables should be

linear, the relationship between any two variables being examined through Spearman

correlation should be monotonic.

The study design and data collected met these assumptions. It should be noted that the

difference between monotonic and linear relationships is that monotonic relationships simply

require distributions to trend in mutually positive, negative, or inverse directions, while linear

relationships suggest that there is a possible point of complete linearity.

Mann-Whitney U Test Results

The results of this test can be found in Table 4.7. I found that there were no significant

differences between groups in the change scores of participants’ self-reported use of specific

assessment practices employed, or the purposes for which such practices were utilized. I

observed that change scores were nominal and eclipsed by the variance of participants’ scores.

Table 4.7. Mann-Whitney U Test Results on Assessment Practice Mean Change Scores (N = 43)

U Z p

Forms Written Tests & Quizzes 182.00 -1.13 .261

Written Classwork & Homework 219.50 -0.14 .886

Group Performances 162.00 -1.58 .114

Individual Performances 207.50 -0.45 .651

Projects 176.00 -1.28 .199

Portfolios 185.00 -1.19 .236

Attendance 214.50 -0.28 .778

Participation 196.50 -0.75 .454


Purposes Summative 196.00 -0.86 .391

Formative 190.00 -0.91 .365

Diagnostic 189.00 -0.911 .362

Placement 221.50 -0.09 .926

Extramusical 214.50 -0.27 .791


Pearson Correlation and Spearman’s Rho Results

The results of these analyses can be found in Table 4.8. Using Pearson correlations, I

found a significant – albeit modest to moderate – relationship between the CALI and MTABI

pretest and posttest scores. Participants who scored highly on the CALI pretest and posttest were

also likely to score highly on the MTABI pretest and posttest. Participants who scored poorly on

the CALI pretest and posttest were somewhat likely to score poorly on the MTABI pretest and

posttest.

Using Spearman’s Rho analyses, I found several significant inverse relationships between

CALI posttest scores and participants’ self-reported use of written tests and quizzes, written

classwork, and participation. Participants who scored higher on the CALI posttest were less

likely to report employing written tests and quizzes, written classwork, and participation as

appraisals of students’ learning than peers who scored poorly on the CALI posttest.

I also found several significant relationships between participants’ assessment beliefs on

the MTABI pretest and posttest. Participants who scored highly on the MTABI pretest were

somewhat more likely to self-report using written tests and quizzes. Interestingly, participants

who scored highly on the MTABI posttest did not self-report using written tests and quizzes with

any greater frequency than participants who scored poorly. While there was not a relationship on

the MTABI pretest between those who reported using group performances, there was a significant


inverse relationship between those who scored highly on the MTABI and those who self-reported

using group performances with more regularity. That is, music teachers who indicated a higher

regard for assessment tended to self-report using group performances less frequently than those

who held a lower regard for assessment. This finding was the same for participants’ participation

scores; that is, there was not a relationship between participants’ MTABI pretest scores and self-

reported use of participation, but there was a significant inverse relationship between

participants’ posttest MTABI scores and self-reported use of participation. While there was a

significant inverse relationship between participants’ MTABI pretest scores and self-reported use

of attendance, on the posttest there was virtually no relationship to participants’ self-reported use

of attendance. At the outset of the study, participants with a higher regard for assessment were

less likely to report using attendance as an appraisal of student learning. Given that participants’

belief scores on the MTABI posttest held stable, it appears that attendance was no longer utilized

by high or low scorers. This finding likely had more to do with participants’ transition to digital,

distance learning platforms. Finally, participants’ who scored highly on the MTABI pretest and

posttest were moderately likely to self-report using assessment for formative purposes.

Table 4.8. Relationships between Assessment Literacy, Beliefs, and Practices (N = 43)

CALI Sum Score MTABI Sum Score Pretest Posttest Pretest Posttest

CALI Sum Score – – 0.33* 0.36*

MTABI Sum Score 0.33* 0.36* – –

MTAII Sum Scores

Written Tests & Quizzes 0.07 -0.38* 0.37* 0.11

Written Classwork 0.04 -0.33* 0.09 0.14

Group Performances 0.12 -0.04 0.10 -0.35*

Individual Performances -0.01 0.14 0.23 -0.10

Projects 0.06 0.08 0.04 0.02

Portfolios -0.27 0.17 0.13 0.13

Attendance -0.07 0.04 -0.31* 0.05

Participation 0.07 -0.33* -0.29 -0.36*


Summative -0.05 -0.05 -0.02 -0.07

Formative -0.01 0.09 0.52** 0.38*

Diagnostic 0.05 0.23 0.21 -0.20

Placement 0.02 0.07 0.20 -0.02

Extramusical 0.07 -0.24 0.11 0.21

**Significant at p < .01 (2-tailed).

*Significant at p < .05 (2-tailed).

Feedback from Intervention Participants

I solicited participant feedback about the intervention at the end of the posttest in the

form of five questions:

● Was the online professional development relevant to you as a music teacher?

● Was the online professional development course appropriately challenging or too

difficult?

● What did you like about the online professional development course?

● What would you have changed about the online professional development course to

make it more enjoyable or useful?

● Would you recommend this online professional development course to other music

educators?

I viewed these ancillary questions and intervention group members’ responses as an

important opportunity to contextualize findings, offer insights into future online course designs,

and provide information about what music teachers need and desire from professional

development experiences. This feedback was not collected or used for analysis, but only to aid in

interpretation and forming recommendations that could benefit music teachers directly. Eight

participants, of the 18 enrolled in the intervention, answered these questions; because slightly


fewer than half of the intervention participants provided comments, these should be interpreted

with this limitation in mind. The full opus of participant responses can be found in Appendix L.

Question One

All eight participants indicated that the course was relevant to their practice as a music

teacher. Responses ranged in detail from one-word affirmations, to explanations about what they

found particularly useful. One participant noted, “it was very relevant for me as a [sic] jh and hs

choir teacher, and I will be referring back to the chapters as I need.”

Question Two

With regards to the challenge of the course, there was considerable variability in

responses. Four of the respondents felt that the course was appropriately challenging. One

participant did note that specific sections were more challenging than others (“there was only one

section that I got sort of lost, but it wasn’t overly difficult”), but did not provide information

about which section was more challenging for them. Two felt that the course was challenging,

but only because of their current circumstances, in light of the COVID-19 pandemic; one opined,

“I did not have enough time to participate as fully as I would have liked”, and another lamented,

“appropriately challenging for ordinary circumstances, difficulty to manage in what became my

current situation.” One participant thought that the course was “to [sic] easy.” Such variability

was to be expected given the circumstances, and as in any normal learning environment.

Question Three

With regards to what participants liked about the professional development, there was a

wide range of responses addressing course design, materials, and activities. Most of the feedback

centered on the relevance and accessibility of the readings, especially the primary text by Brian

Shaw, “Music Assessment for Better Ensembles” (2018). One participant said, “I thought there


was a lot of great content in the primary book we were reading and discussing from. I always

love to learn of a new-to-me author or researcher”, and another stated, “I plan on buying the

book that some chapters of the readings were taken from so I can read it in full and mark it up.”

Others felt the online format was appealing and accessible. Several commented that the activities

were helpful, “focused on real world teaching problems”, that discussions were “engaging” and

“interactive”, and that the course allowed them the “chance to practice something like designing

an assessment.” As noted previously, designing a course that music teachers valued was a

personal goal of mine during this study. This feedback was helpful and aligned to other

researchers’ findings about online professional development (Boling et al., 2011; vanOostveen et

al., 2019; Wasserman & Migdal, 2019) and professional development targeted to teachers’

unique contextual factors (Guskey, 2003, 2009).

Question Four

Participants’ in the intervention also offered constructive criticism about what course

elements could be altered to enhance their experience. Two participants felt that the course was

fine in its current state, and two others only wished that it had not coincided with the COVID-19

pandemic, but acknowledged that “you [the researcher] had zero control of that” because the

course was tied to a dissertation project. One participant wished that the “course was offered

over a longer period. Eight or more weeks would have allowed me the time to participate more

fully.” Two participants offered helpful feedback about the accessibility of course materials; one

wished that readings were also offered in audio or video formats, and another wished there were

slightly fewer readings. Taking participants’ comments into account, I believe future iterations of

this course would benefit from greater use of alternative formats, and emphasizing some

materials over others, perhaps by making some optional.


Question Five

When asked if they would recommend the course to other music teachers, all eight

participants who responded answered in the affirmative, and some were insightful about how

their peers may perceive the course. One noted, “[the course] approaches some negative thinking

and stereotypes that i hear and see from other music teachers in a way that is very clear and

shows alternative approaches, but it may be too [sic] engrained in some people to have a positive

outtake from this pd.” Another stated that “well designed PD in assessment is badly needed”, and

“I think a lot of teachers in my district would benefit from this development.” These comments

aligned with prior findings from other researchers, as well. Teachers do appear to desire

professional development, but they need it to be targeted to their context (Guskey, 2003, 2009).


Chapter 5

Summary and Conclusions

Shifts in educational policy over the last thirty years have increased the demand for

teachers to be proficient in classroom assessment. Yet, teacher educators have been slow to

respond in developing teachers’ assessment competency through alterations to their curriculum

(Darling-Hammond et al., 2002; DeLuca & Klinger, 2011; Gareis & Grant, 2015). However,

researchers have found that preservice music teachers who do receive training in assessment

have more favorable beliefs about assessment and their ability to use it effectively (Austin &

Russell, 2019), and that sustained and relevant professional development can effectively change

inservice teachers’ beliefs surrounding assessment (Huai et al., 2006; Koh, 2011; Mertler, 2009).

To date, no one has examined the effectiveness of using an online professional development

program to enhance music teachers’ assessment literacy, beliefs, and practices.

Therefore, the purpose of this pretest-posttest control group study was to examine the

effects of an online professional development intervention on music teachers’ assessment

literacy, beliefs, and practices. In the spring of 2020, I solicited participation from music

educators with NAfME membership. After two weeks, I obtained informed consent from 108

respondents. A total of 43 participants completed all requirements of the study: 18 in the

intervention group and 25 in the control group. Participants in the intervention group enrolled in

a four-week professional development (PD) focused on increasing music teacher assessment

literacy based upon the first four Standards for Teacher Competence in Educational Assessment

of Students (STCEAS). All participants completed a pretest and posttest consisting of three

measures: the Classroom Assessment Literacy Inventory (CALI), the Music Teacher Assessment


Implementation Inventory (MTAII), and the Music Teacher Assessment Beliefs Inventory

(MTABI).

In this chapter, I will provide a summary of the major findings, situate findings within the

extant literature, offer implications for music teacher PD, discuss limitations of the study, and

provide recommendations for future research.

Summary of Findings

Findings are organized by the four research questions of this study:








and practices?

Assessment Literacy

Using multivariate analysis, I compared assessment literacy and assessment beliefs

change scores of intervention and control group participants. Given a significant multivariate

outcome, I conducted follow-up univariate tests to determine whether group differences applied

to one dependent variable or both. I found a significant difference between groups’ assessment

literacy scores, with a large effect size; thus, differences between assigned groups were unlikely

due to chance. Intervention group participants, on average, answered one to two additional

questions correctly (out of twenty; thus, increasing their score by about 15%) on the posttest than


they did on the pretest. Control group participants, on average, answered the same number of

questions correctly from pretest to posttest. On average, participants demonstrated the most

growth in responding to items from Standard 1 (selecting appropriate assessments), modest

growth for Standard 2 (designing and implementing assessments), while showing nominal to no

growth on items from Standard 3 (scoring and interpreting assessments) and Standard 4 (making

educational decisions based upon assessment data). Participants were most literate in selecting

assessments as well as scoring and interpreting assessment results, less literate in designing and

implementing assessments, and least literate in making decisions based upon assessment data.

Assessment Beliefs

I did not find significant group (intervention, control) differences in assessment belief

gain scores, although there was a qualitative change in music teachers’ assessment beliefs from

the pretest to the posttest. Overall, intervention group participants demonstrated modest growth

in assessment beliefs, while control group participants demonstrated relatively stable assessment

beliefs across time (Appendix M). Participants’ assessment belief scores averaged in the upper

third of possible values, suggesting an overall positive regard for assessment.


I found no significant differences between assigned groups for music teachers’ self-

reports of how frequently they used specific forms of assessment or how frequently they used

assessment to serve specific functions.

Relationships between Music Teachers’ Assessment Literacy, Beliefs, and Practices

I found a positive relationship between music teachers’ literacy and belief scores. This

suggests that participants who scored well on the literacy items tended to hold positive beliefs

about the usefulness, value, and trustworthiness of assessment as a basis for educational


reporting and decision making. There were several significant inverse relationships between

music teachers’ literacy scores and their self-reported use of written tests and quizzes, written

classwork, and participation to appraise student performance. There were also significant

relationships between music teachers’ belief scores and their self-reported use of specific

assessment forms (e.g., written tests and quizzes, group performances, attendance, and

participation) and the purpose for which they used assessment (e.g., formative assessments).

Discussion

In this section, I connect major findings to prior literature. Next, I contextualize the major

findings of this study by revisiting and re-imagining McMillan’s (2003) educational decision-

making conceptual map and illustrate how additional factors may shape music teachers’

assessment decision-making. Finally, I explore the role that measurement played in obtaining

these results.

Major Findings

Music Teachers Lack Prior Assessment Training

Only four of the music teachers in this study reported feeling prepared to assess students

after graduation. This finding aligned to Mertler’s (2001) survey of over six hundred inservice

teachers, where roughly the same proportion of respondents reported feeling prepared to assess

student learning after graduation. Consequently, the degree to which preservice training in

assessment moderates inservice music teachers’ assessment beliefs remains to be seen; yet, there

is some evidence that training and coursework for inservice music teachers is associated with

positive assessment beliefs (Austin & Russell, 2017).

While education experts and theorists argue that assessment is part of the instructional

process, it seems that teachers – and music teachers, specifically – do not necessarily conceive of


assessment as an integrated component of instruction. This perception – that assessment is a

significant and uniquely different skillset than instruction – is likely partially informed by the

lack of prior assessment training that music teachers receive. When asked if they felt prepared to

be a music teacher following undergraduate study, nearly three quarters of my participants

answered in the affirmative; yet, when asked if they felt prepared to assess student learning, only

one out of ten participants answered in the affirmative. Ludwig (2013) found that inservice

teachers who felt confident about their assessment knowledge were more likely to have prior

training in assessment, and to hold positive beliefs surrounding the accountability purposes of

assessment.

To date, music education researchers have not explicitly investigated assessment literacy

amongst inservice or preservice music teacher populations. I did collect demographic

information about participants’ prior assessment training and terminal degree; my findings were

comparable to Austin and Russell (2016), when they surveyed graduate programs offering music

education degrees. They found about three out of five of institutions offered a stand-alone

assessment course, and of those, three-quarters required master’s students to take the course.

Thus, it was not surprising that of the 30 participants in my study who held master’s degrees,

only four had taken a prior stand-alone course in assessment, and five felt prepared to assess

students when they entered the teaching profession. Further, of the 30 participants in my study

that held master’s degrees or higher, only eleven had attended a prior workshop focused on

assessment. Austin and Russell (2016) did not collect demographic and assessment training

information from inservice music teachers. Yet, it is not surprising to see that when institutions

do not require assessment courses as part of their curriculum for students, few inservice music

teachers report prior coursework focused exclusively on assessment. While this finding does not


account for institutions which imbed assessment training within other methods or curriculum

courses, it is nonetheless apparent that current curricular practices did not lead to assessment

literate participants in this study. Between a lack of dedicated coursework, and evidence that

professional development is more effective in changing practice than preservice coursework

(Gutierrez, 2014), I believe professional development may hold greater potential for developing

assessment literacy in the music teacher population than preservice coursework.

Online Professional Development Formatting

Participants’ assessment literacy increased following an online intervention; this echoes

findings from researchers who have studied the efficacy of online professional development for

inservice teaching populations. Based upon feedback from intervention participants (n = 8),

music teachers in the present study also appeared to appreciate the use of non-traditional formats

(i.e., delivery mechanisms, materials, and activities not used in face-to-face learning

experiences). Boling et al. (2011) and vanOostveen et al. (2019) reported similar findings while

investigating teachers’ beliefs about theories of learning via online PD.

To date, researchers have not specifically studied the efficacy of specific course elements

(other than online or face-to-face formats). However, researchers have found that use of novel

online information and communication technology (ICT) in digital formats is more effective than

adaptation of traditional presentation strategies (e.g., taped lectures) in online PD (Boling, et al.,

2011; DeLuca, et al., 2004; Huai, et al., 2006; vanOostveen, et al.2019; Wasserman & Migdal,

2019; Wang, et al., 2008). Such findings are partially due to the collaborative opportunities such

formats provide, as well as the potential for creating relevant PD that addresses teachers’ unique

contextual factors (Guskey, 2003, 2009). In developing the intervention, I purposively selected

activities that would balance opportunities for collaboration (e.g., the weekly discussions using


the Perusall application) with authentic application of content in music teachers’ unique context

(e.g., the Teacher-Constructed tasks). However, it may be worthwhile to examine the

effectiveness of different kinds of tasks, such as journals, in enhancing teachers’ assessment

literacy, beliefs, and practices.

Assessment Literacy Can Be Impacted through Intervention

I found that intervention group participants demonstrated significantly greater assessment

literacy than their control group peers at the conclusion of the study. Fan et al. (2011) also found

inservice teachers’ (N = 47) assessment literacy increased on the Assessment Knowledge Test

(AKT) – a researcher-designed measure – following a six-week online professional development

course. The AKT bears little resemblance to the CALI; thus, growth comparisons are difficult to

make. The AKT was not designed to align to the STCEAS, used a different number of items, and

was scored differently than the CALI. However, Fan et al.’s finding does lead me to wonder

whether a longer intervention (e.g., six or eight weeks) would have led to significant changes in

music teachers’ assessment beliefs. Koh’s (2011) finding that sustained professional

development (albeit, in a face-to-face format) was more effective than traditional “one shot”

workshops in changing teachers’ assessment literacy in the long term suggests that such a change

should be further investigated. In fact, in feedback from intervention participants, one teacher

specifically commented that they would have preferred a longer workshop. When considering

the length of my intervention, I considered the rigor of the course against the possibility of

participant attrition if the course was too long. In future uses of this professional development I

may elect to extend the duration to eight weeks, allotting two weeks per module, which may

allow participants more time to synthesize and reflect upon content as they perform the teacher-

constructed task. Or, in extending the length of the intervention, I could allot more time for


standards that participants perceived as more challenging to understand, or that were resistant to

improvement on the posttest. In future studies, I may compare the impact of varying intervention

durations upon assessment literacy, beliefs, and practices.

Mertler (2009) used the Assessment Literacy Inventory (ALI) to measure teachers’

assessment literacy before and after a two-week professional development intensive (N = 7). The

ALI is comparable to the CALI; there are an equivalent number of items aligned to the STCEAS,

and items are scored identically; only the grouping and organization of items differ between

measures. Mertler (2009) found that participants demonstrated, on average, lower competency

amongst items from Standards 1, 2, and 3 on both the pretest and posttest in comparison to those

enrolled in the intervention in my study. However, Mertler’s participants did show greater

change in their scores on items from Standards 2, 3, and 4 from pretest to posttest, as well as

greater change overall on all four standards from pretest to posttest. In contrast to Mertler’s

(2009) findings, I found that participants’ scores increased the most for items comprising

Standards 1 (selecting appropriate assessments) and 3 (scoring and interpreting assessment

results). While Mertler’s intervention was a two-week in-person intensive, mine was an online

four-week asynchronous experience; differences in participants’ assessment literacy following an

intervention may be, in part, attributable to the difference in duration and/or format. Further,

Mertler studied general education inservice teachers (i.e., English, math), while I exclusively

studied exclusively inservice music teachers. These differences could also be due to changes in

preservice teacher curriculum since Mertler conducted his study.

Assessment Beliefs are Related to Assessment Literacy

I found a modest significant relationship between assessment beliefs and assessment

literacy, indicating that those who were more assessment literate tended to hold higher regard for


assessment, and vice versa. While the intervention did not directly focus upon or target teachers’

assessment beliefs – though perceptions about assessment and its value were certainly alluded to

in the weekly discussion board – it is evident that the experience had some influence on

participants’ beliefs surrounding assessment. The focus of the intervention was primarily on

assessment literacy, i.e., the adaptable knowledge of processes and methods used to evaluate

student learning. Specifically, the intervention was created to enhance teachers’ knowledge about

selecting, designing, implementing and scoring, and interpreting assessments in their teaching

practice. Intervention participants demonstrating a significant increase in assessment literacy

compared to their peers was one of my original hypotheses.

Participant assessment literacy scores and assessment belief scores were moderately

related, suggesting that assessment literacy may inform assessment beliefs. That is, the more

participants know about assessment and how to implement it effectively, the higher regard they

may hold for assessment. Indeed, participants with higher assessment literacy scores tended to

hold higher regard for assessment. This was in keeping with Fan et al.’s (2011) findings after

investigating the effectiveness of an online program to enhance secondary inservice teachers’ (N

= 47) assessment knowledge and perspectives. This further suggests that assessment literacy

may, in some way, be influenced by the beliefs that music teachers hold about assessment

(Austin & Russell, 2019). Or, conversely, greater assessment literacy may inform music

teachers’ beliefs about assessment. Quilter and Gallini (2000) found that teachers’ past

experiences with classroom assessment correlated highly with their current beliefs.

Assessment Beliefs Appear Stable Across Time

Music teachers in this study held an overall positive view about the value and purposes of

assessment at the outset and conclusion of the study. While some beliefs nominally changed,


such changes were not significantly different between groups over time. While the four-week

period between the administration of the pretest and posttest may not have been enough time to

impact participants’ beliefs, I believe that assessment beliefs and other internal factors described

by McMillan may be resistant to change due to their connection to music teachers’ overall self-

identification with what Bartel (2004) and others have termed the “teacher-conductor” model. In

this model, music teachers have traditionally been characterized as “teacher-directed,

authoritarian” leaders compelling students engage in “re-creative rather than creative

experience[s]” (Countryman, 2008, and Reimer, 1989, as cited in Berg, 2014, p. 263). Isbell

(2008) suggested that music teachers’ self-identities may be formed during preservice training

through conflicting narratives about their role as performers and educators. Such identity

formation may be reinscribed through secondary- and tertiary socialization relationships with

previous secondary ensemble teacher-conductors, and preservice collegiate conductors and

instructors (Berg, 2014).

Music Teachers’ Assessment Practices Vary, And Are Largely Informal

While I did not find music teachers’ assessment practices were significantly changed

following the intervention, I did find that they comprised a variety of forms (e.g., performance

tasks, tests, and attendance) and purposes (e.g., formative, summative, extramusical), directly

recalling McMillan’s (2001, 2003) findings. Music teachers largely reported using individual

performances (e.g. “down-the-line” music performance checks), but not other forms of

assessment, such as written tests and quizzes, classwork, portfolios, or projects (Hill, 1999;

Kancianic, 2006; Kotora, 2005; LaCognata, 2011; McClung, 1996; McCoy, 1988, 1991;

McQuarrie & Sherwin, 2013; Russell & Austin, 2010; Simanton, 2000). Half of my sample

reported using attendance, and over three quarters reported using participation to assess students


on a weekly basis. This finding corroborates Russell & Austin’s (2010) finding that many

student appraisals are accounted for by “non-academic criteria.” Participants’ self-reported use of

attendance and participation appeared to wane on the posttest, and there was not a difference

between assigned groups. Likely, the COVID-19 pandemic rendered such assessments moot as

PK-12 music educators shifted to online instruction.

Assessment Beliefs and Literacy are Related to Assessment Practices

My findings did align with Austin and Russell’s (2017) observation that music teachers

who valued assessment were more likely to assess student learning for formative purposes, and

to eschew extramusical purposes of assessment (i.e., participation, attendance, and other

compliance-based behavioral targets). I also found that participants who purported to use

assessment for formative functions also tended to hold assessment in higher regard. Perhaps

teachers who utilize formative assessments with greater frequency find that students demonstrate

greater understanding of taught material, or due to increased assessment literacy, no longer

perceive assessment as a skillset separated from their teaching practices.

Music Teachers’ Assessment Practices May Be Impacted by Other Factors

In this study, music teachers’ assessment beliefs and practices did not significantly

change. While direct remediation of assessment beliefs and practices, extending the intervention,

and attempting to further mitigate the impact of COVID-19 on participants’ experiences may

have changed the outcome of this study, music teachers’ decision-making process may also be

impacted by additional factors. Measuring the influence of external factors (e.g., state, district, or

building policies, parents, etc.) or classroom realities (e.g., building schedules, teaching load,

resources, and student characteristics) was not an aim in this study. However, these factors


undoubtedly influence teachers’ decision making regarding specific assessment practices, as

McMillan (2003) found.

Music teachers’ decision making may be uniquely impacted in comparison to their

general education peers by their positions and identities as musicians. Prior researchers have

found that music educators’ identities are shaped by factors such as personal philosophy about

the purpose of music education (LaCognata, 2010; Richerme, 2016), whether music teachers

identify more strongly as an educator or director (Isbell, 2008), and if they believe assessment

resides outside their creative roles (Denis, 2018). It may be important to examine music teachers’

decision-making processes independently of their general education peers. Otherwise, music

teachers will continue to (a) feel inadequate about their ability as assessors of student learning;

(b) disregard their role as assessors; and (c) execute the role ineffectively.

Music Teachers’ Classroom Assessment Decision-Making

McMillan’s (2003) conceptual map of teachers’ assessment decision making (Figure 1.3,

p. 23) depicts the factors teachers take into consideration when selecting assessment practices.

McMillan (2001, 2003) developed this model using survey and semi-structured interview data

collected from 27 English and mathematics teachers. McMillan described six major themes: (a)

internal factors (e.g., teachers’ knowledge, values, and beliefs), (b) external factors (e.g., state

accountability policies, district policies, and parents), (c) tension between internal and external

factors (e.g., grades, discipline, student success), (d) classroom realities (e.g., absenteeism,

heterogeneity, and limited resources), (e) decision making rationales (e.g., student engagement,

student success, and difficulty), and (f) assessment practices. In McMillan’s conceptual model

assessment practices are the manifested, tangible outcome of the six factors.


Based upon my findings, I have altered McMillan’s model (Figure 5.1). My model differs

from McMillan’s (2003) in two ways; while McMillan conceptualized internal factors as

inclusive of teachers’ knowledge, beliefs, expectations, and values, in my model I have

extrapolated assessment literacy (i.e., adaptable assessment knowledge) and assessment beliefs

(i.e., conceptions and values associated with assessment) from internal factors. Teachers’

knowledge and beliefs outside of assessment, as well as their expectations and values (e.g.,

personal philosophy of music education, goals for the music program, etc.), may serve as

additional constructs informing “internal factors” unique to music educators. External factors and

classroom realities may also require extrapolating factors that are unique to music educators.

Figure 5.1

Music Teachers’ Classroom Assessment Decision-Making

In both McMillan’s and my figure, assessment practices are the output of competing,

inter-related factors that form a teacher’s decision-making rationale for selecting a specific

assessment (or a specific purpose for assessing students). Internal and external factors contribute

to dissonance within the teacher; for example, a teacher’s philosophy about the purpose of


education may conflict with external demands from administrators (e.g., enrollment, classroom

management, increasing test scores, etc.). As teachers respond to these tensions, their own

internal narratives may shift, or the narratives of the external actors may shift. The teacher whose

philosophy was in conflict with the administrator’s demand for a different classroom

management style may react to the conflict by changing their philosophy, or by convincing the

administrator that the approach they are using is in the best interest of students. Internal and

external factors also directly impact the teachers’ rationale for subsequent assessment practices.

Additionally, classroom realities such as student heterogeneity and access to resources directly

impact teachers’ assessment decision-making. Teachers with large sections of students with

varied needs may feel pressured to use fewer, or less formal, assessments, or types of

assessments, with less frequency. I believe that assessment beliefs and literacy also directly

impact teachers’ assessment decision-making, as well as moderate one another. For example,

teachers with lower levels of assessment literacy likely hold a lower regard for the value of

assessment (and vice versa), which subsequently impacts their assessment decision-making.

Teachers’ assessment beliefs and literacy may also impact other internal factors, such as

philosophical narratives about the purpose of education. Those with low regard for assessment or

a lack of knowledge about assessment processes may feel that education is not a measurable or

observable endeavor that can be captured in data, for instance. Such a stance also may impact a

teachers’ willingness to learn more about assessment (i.e., increase their assessment literacy), or

change their assessment beliefs.

Teachers’ classroom assessment decision-making is likely not a series of static, discrete

events that result in individual decisions about individual assessments. This process is possibly

continuous, and each of the factors may account for variance in the outcome depending upon the


most pressing needs of the students and teacher. That final decision – the assessment practice

selected by the teacher – may subsequently inform the next series of assessment decisions based

upon the teachers’ ability to reflect on assessment results, and any subsequent changes to their

assessment beliefs and/or literacy.

The Role of Other Factors

Other factors may influence music teachers’ assessment decision-making and allow

music teachers to overlook their roles as assessors. Measuring the influence of these factors (e.g.,

internal, external, tensions between internal and external, and classroom realities) was not an aim

of this study but should be further defined and explored by researchers.

Internal Factors. Internal factors, such a music teachers’ prior training in assessment,

individual program aspirations, or other personal values attached to music education (i.e.,

philosophy, cultural values, or even pedagogical beliefs), likely inform educational decision-

making. For the purposes of future studies, I believe that internal factors should be extrapolated

to include the elements depicted in Figure 5.2. Collectively, these elements may help explain

why some music educators are resistant to utilizing a variety of assessment forms and purposes.

For example, Denis (2018) argued that music educators often did not believe assessment was

appropriate for appraising subjective experiences like music, and that assessment is outside the

purview of their role as directors. In fact, it is this pervasive perspective held by music teachers –

that they are conductors rather than educators – that may contribute to these perceptions

(Mantie, 2012). Music educators who view themselves as conductors rather than educators may

have program aspirations that are more likely to include public recognition (i.e., in the form of

trophies, prestige, and community support), which affirms the conductor identity. Additionally, a

lack of prior assessment training may reinforce the perception that assessment is not an integral


professional duty, especially if prior training holds little relationship to the music teachers’

philosophy, program aspirations, and/or pedagogical beliefs.

Figure 5.2

Internal Factors

External Factors. Factors such as parental expectations for students’ grades, affect, and

achievement, as well as school and district expectations for program size and achievement, may

also play a role in educational decision-making related to assessment (Russell & Austin, 2010).

Music teachers’ often assign higher-than-average grades to students than other content areas

(LaCognata, 2010). This may create an expectation amongst some parents that high grades are to

be expected of music coursework, regardless of students’ demonstrated proficiency surrounding

knowledge about music or technical prowess. Allsup and Benedict (2008) suggested that

challenging parents’ (as well as students’, administrators’, and community stakeholders’)

expectations about the relationship between learning and grading in music courses are rooted in

larger discourses about the legitimacy of music as an academic subject.

Figure 5.3


External Factors

Tension. Researchers in general and music education have not examined the tension

between internal and external factors impact on educational decision-making tied to assessment.

However, the parallel tensions between the roles of performer and educator may provide context

for music teachers’ general fatigue and ambivalence toward assessment. Just as music teachers’

occupational role identity is shaped by conflicting – and often unreconciled – narratives (Isbell,

2008), I believe music teachers’ assessment decision-making may be, as well. When music

teachers’ personal expectations, values, beliefs, and knowledge about assessment outweigh the

relative importance of external factors (or, when music teachers have the autonomy and agency

to implement program and pedagogical change), music teachers may elect to use a greater

variety of assessments for myriad purposes, and conceive of assessment as an integral

component of instruction. When external demands of parents and other stakeholders about the

success, impact, and size of a music program outweigh the relative importance of music

teachers’ internal narratives (i.e., knowledge, values, beliefs, prior training, confidence, etc.),

music teachers’ may subvert their personal desires and select fewer assessments, or fail to see

assessment as an integral component of instruction.

Figure 5.4


Tension

Classroom Realities. Classroom realities, such as the size of classes, teachers’ teaching

schedules, student discipline, and resources (i.e., monetary, technological, material, and human

resources), likely also impact assessment decision-making (Figure 5.5). Music teachers are often

assigned the largest class sections within their schools, the most and varied teaching preps, and

the most challenging teaching and extracurricular schedules, particularly at the elementary

general and secondary levels (Hanzlik, 2001; Hill, 1999; Kancianic, 2006; LaCognata, 2010;

McClung, 1996; Sherman, 2006; Simanton, 2001). Music teachers may feel overwhelmed by the

prospect of designing, grading, and interpreting dozens – if not hundreds – of assessments,

especially if they lack resources such as technology, materials, or adequate planning time. These

factors no doubt contribute to music teachers’ reliance upon extramusical purposes of

assessment, such as attendance, participation, and other compliance-based appraisals of student

performance, especially if music teachers’ have not undergone assessment training.

Figure 5.5


Classroom Realities

The Role of Socialization

As described in Berg’s (2014) review of literature about music teacher preparation and

role-identity, music teachers’ socialization into the profession may play a role in forming a

“teacher-conductor” identity (p. 261). Teacher-conductor identity has traditionally eschewed

more student-centered practices, and led to a reliance on instruction where “a teacher/conductor

[stands] in front of a group of music makers controlling the starts and stops, correctly diagnosing

problems, and effectively prescribing remedies to reach the goal of a flawless performance”

(Bartel, 2004, as cited in Berg, 2014, p. 261). Contemporary music education researchers have

advocated for student-centered educative practices such as focused discussion to promote

musical awareness and critical thinking, and the use of peer-assisted learning activities

embedded within a traditional rehearsal-based context (Berg, 2014). Isbell (2008) suggested that

conflict between the performer (e.g., conductor, director, musician) and educator identities music

teachers hold over the course of preservice training may contribute to practices associated with

either student-centered or teacher-centered pedagogies. Denis (2018), after a review of music

education assessment literature, suggested that music teacher identity formation may even

contribute to perceptions amongst music teachers that assessment is inappropriate, and outside of

their responsibilities as educators. Johnson (2014) suggested that this false dichotomy between

music teachers’ self-identification either as performers or educators could be reconciled through


conducting coursework. Music education researchers’ examination of identity and role formation

through professional socialization could serve as an important framework for understanding the

reasons why music teachers hold beliefs about assessment that are resistant to change, and how

such beliefs could be influenced during preservice training.

Formation of music teacher identity occurs during primary, secondary, and tertiary

socialization. Primary socialization experiences occur prior to preservice teacher training, often

through formative familial experiences. Berg (2014) suggested that beliefs formed during this

period are “often not questioned and can be emotionally charged (Berger & Luckman, 1966),

thus functioning as one’s habitus (Bourdieu, 1993) or [contributing to] ideas about appropriate

actions, values, and one’s function in society (DeMarrais & LeCompte, 1999)” (p. 266).

Secondary socialization occurs in the years immediately preceding preservice training;

researchers have suggested that this period is often critical to music teachers’ decision to pursue

collegiate music training (p. 267). Inservice music teachers tend to most identify with the role of

music teacher or performer prior to preservice training based upon performance experiences in

secondary ensembles (Berg, 2014; Isbell, 2008). Austin and Reinhardt (1999) suggested that

preservice music teachers’ philosophical beliefs tend to remain stable over time, further

reinforcing later researchers’ findings that preservice training may not necessarily alter

perspectives about the value and purpose of music education. Isbell (2008) found that

undergraduates’ occupational identity was best predicted by secondary socialization experiences.

Thus, it is not surprising that preservice and inservice teachers’ who recall secondary

performance experiences under secondary ensemble conductors – in combination with the

perception of director’s roles as performers – decide to pursue music education.


Tertiary socialization occurs during occupational role construction, often during

preservice teacher coursework. As preservice music teachers acquire knowledge and experience

teaching music, they reconcile or integrate their prior secondary socialization experiences to

form an overall occupational identity. This identity (and associated values), while resistant to

change, is not immutable, and may be reshaped based upon the setting (Berg, 2014). Preservice

music teachers’ occupational identity consists of three constructs, according to Isbell (2008):

musician, self-perceived teacher, and teacher identity as inferred from others. Thus, providing

preservice and early career music teachers with experiences that challenge their prior

conceptions, engage them in reflection, and address the emotionally-charged components of their

philosophies could prove vital to shaping music teachers’ beliefs surrounding their roles as

assessors, and the integration of the role of assessor and educator.

Measuring Assessment Literacy, Beliefs, and Practices

Evaluating the effect of the intervention on participants’ assessment literacy, beliefs, and

practices was the primary purpose of this study. Yet, to do so, it was critical to have effective

instruments; reliably measuring assessment literacy, beliefs, and practices in valid ways was

critical to determining the efficacy of the intervention. Selecting or designing appropriate

instruments for this study impacted the quality and kind of data I gathered. My results lead me to

wonder about the calibration and dimensionality of the instruments employed, and what

researchers must consider in future investigations surrounding assessment literacy, beliefs, and

practices.

Calibration

With regards to assessment literacy, I selected the CALI for the following reasons: (a) it

was the most widely utilized measure by researchers who have examined assessment literacy in


inservice and preservice teacher populations, (b) it was designed to align with the Standards for

Teacher Competence in Educational Assessment of Students (STCEAS), which I was also using

to plan instruction for the Music Teachers’ Assessment Workshop (MTAW), and (c) the questions

posed were based upon realistic vignettes portraying application of assessment knowledge (i.e.,

the knowledge measured was procedural, not inert, which is a key attribute of assessment

literacy). Yet, the internal consistency of the CALI remains compromised. On the CALI there are

five items associated with each standard. I found, within each standard, that some items were

neither difficult nor able to discriminate between the highest and lowest performers. Based upon

this finding, I believe that both more items are needed per standard (to increase the internal

consistency), and that the items should be more carefully calibrated to challenge and discriminate

between responses.

For example, the five items corresponding to the first STCEAS standard (“selecting

appropriate assessments") addressed issues surrounding reliability and validity, selecting the

most appropriate assessment strategy from a list for a specific scenario, and rationales underlying

the selection of an assessment. While these are all pertinent to the adaptable knowledge required

to select appropriate assessments, it may be necessary for there to be more items spanning these

concepts, and/or for more elements to be considered and represented under this standard.

Dimensionality

Prior researchers have criticized the CALI for its lack of internal consistency, as well as

the lack of fit between specific items and the subscale (i.e., the STCEAS) they comprise

(Alkharusi, 2015; Hailaya, et al., 2014; Ryan, 2018). Alkarusi (2015) and Hailaya et al. (2104)

suggested that assessment literacy – as measured by the CALI – was a unitary construct; that is,

that it was not comprised of the seven subscales corresponding to the STCEAS. Ryan (2018) “did


not draw any definitive conclusion in support of one internal structure, but the results from [her]

study at least demonstrate that the ‘clean’ and ‘tidy’ Standards-based conceptualization of

assessment knowledge is questionable, and perfect alignment with the seven standards is highly

improbably regardless of the sample used” (p. 244). Future examination of assessment literacy

must begin with measurement studies to determine the dimensionality of assessment literacy, and

whether it is a unitary construct, or multi-dimensional construct.

Assessment beliefs have also been measured by researchers using instruments predicated

on a mixture of unitary or multi-dimensional definitions. For example, the work of Brown (2004,

2006) and his colleagues (2009a, 2009b, 2011a, 2011b, 2011c, 2012, 2015) was based upon his

instrument, the Teachers’ Conceptions of Assessment (TCoA). Brown and his colleagues

conceived of teachers’ assessment beliefs as multi-dimensional, comprising three primary

themes: (a) accountability purposes, (b) formative feedback purposes, and (c) irrelevance to

teachers’ practices. There have been numerous measurement studies both confirming and

disconfirming statistical fit to these dimensions (Azis, 2015; Remesal, 2011; Segers & Tillema,

2011); however, Brown has also cautioned that the fit of this multi-dimensional conception of

assessment belief appears dependent upon the population used. Harris and Brown (2009) used a

phenomenographic approach to explore the purposes that a sample of 26 New Zealand teachers

ascribed to assessment and arrived upon four dimensions. Allal (2013) asked ten Swedish

teachers to select student work and discuss the rationale for their appraisals, and found that

assessment techniques were socially situated, and often informed by conflicting internal and

external factors. Using Brown’s TCoA, Azis (2015) conducted an explanatory mixed-method

study to investigate assessment beliefs and their influence on assessment practices; they also

found agreement with Brown’s three-dimension conception of assessment beliefs.


While the MTABI was adapted from the TCoA by Austin & Russell (2017), it differs from

Brown’s instrument in that it is a global measure of assessment beliefs. However, it does

demonstrate a sustained high degree of internal consistency. The statements used in the MTABI

encompass evaluative (e.g., “assessment results are trustworthy”) and affective (e.g., “assessment

results are rightfully ignored by most music teachers”) aspects of belief related to assessment,

stated in a positive or negative manner. As with the CALI, researchers should continue to

examine the dimensionality and malleability of assessment beliefs, and the often-conflicting

narratives that inform such beliefs.

I designed the MTAII, in part, because researchers have not reached definitional

consensus about what constitutes assessment practices. Some researchers defined practices as the

specific forms of assessment used (Aschbacher, 1999; Frey & Schmitt, 2010; Hanzlik, 2001;

Hill, 1999; Kotora, 2005; McMillan et al., 2002; Russell & Austin, 2010; Sears, 2002; Sherman,

2006), while others defined assessment practices as the purposes for which teachers employ

assessment (LaCognata, 2010; Mertler, 2000; McClung, 1996; McMillan, 2001; Oosterhof,

1995; Zhang, 1996). It is possible that assessment practices may encompass both the forms and

purposes for which teachers use assessment. However, it is also important to consider the way

that data are collected; some researchers, such as Kancianic (2006), measured the number of

assessments teachers report using, while others measured the frequency using a scale (Mertler,

2000), or the degree to which teachers felt skilled while employing specific assessment practices

(Zhang, 1996). The common feature of these data collection techniques are that all relied upon

accurate self-reporting from teachers. There is no evidence suggesting that teachers are accurate

in their self-reporting of assessment practice data, and researchers should continue to investigate

more objective (e.g., observation-based) ways of collecting this kind of data.


Implications

...for Music Teacher Development

Given the findings of this study, further contextualized by intervention participant

feedback (Appendix L), I believe there are important implications for inservice music teachers,

school districts, music teacher educators, and national arts education organizations.

Inservice music teachers must advocate for and seek targeted, district-supported

professional development that addresses their content area. Professional development should be

designed in a flexible format (e.g., face-to-face, online, or a hybrid of the two), and facilitate

application of the desired competencies for music teachers. Facilitated practice has been shown

to be more effective in changing teachers’ knowledge and practices than traditional lecture-based

professional development formats (Chen, 2007). Professional development is the most

appropriate avenue to rectify assessment illiteracy within the music teaching profession, as

current research suggests that preservice assessment coursework (regardless of delivery via a

stand-alone or integrated course) may have little impact on the subsequent assessment practices

of inservice teachers (Gutierrez, 2014). Based upon the results of this study, feedback from

participants, and prior research, it is evident that music teachers may desire additional

assessment knowledge development opportunities that their preservice and/or master’s degree

experiences, state conference experiences, and prior inservice professional development

experiences have not provided.

Music teacher educators also stand to benefit from these findings. While some suggest

that assessment literacy development occurs most rapidly with inservice teachers, perhaps

because they are able to enact novel strategies in their own classrooms, preservice music teachers

may later become more open to professional development if assessment concepts are introduced


during their training in a fresh way. Music teacher educators should find ways to incorporate

more assessment training into their curriculum, whether through formal changes to program

course requirements, or embedded activities where students create, implement, and reflect upon

assessments they give in their practicum placements. For example, the assessment construction

project that music teachers enrolled in my intervention completed could be readily adapted for

preservice teachers to use in their field-based teaching experiences practica. Participants in this

study found those experiences – where activities were aligned to course readings and objectives

– valuable. Preservice teachers would no doubt also benefit from such activities, even if they did

not lead to sustained assessment literacy in their inservice teaching. Learning from both their

cooperating teacher and education professors could likely be more impactful than traditional

preservice assessment training, which is typically delivered in the form of single-class lectures,

or out-of-context assignments.

While NAfME does have a visible position statement about assessment in music

education on their website, as well as guidelines for music teachers, school boards, legislators,

and other decision-makers, it is not clear that there is consequential organizational support for

developing music teachers’ competency as assessors. NAfME, as well as other teacher advocacy

organizations, may benefit from the findings of this study by providing the teachers they

represent with effective professional development. In particular, NAfME is better situated to

access and appeal to music teachers (both inservice and preservice) than any other organization.

Whether through their online presence as a national organization, or satellite presence via state-

level music education associations and conferences, NAfME can and should endorse, finance,

develop, and implement professional development to achieve the aims of their position

statement. Further, NAfME, and other national arts education organizations, should work to


develop music-teacher specific assessment literacy standards that consider the unique contextual

features and attributes of music educators.

...for Future Implementations of this Intervention

In future iterations of this intervention, I will alter several elements of the intervention.

Specifically, I would incorporate reflection-based activities to directly address music teachers’

assessment beliefs, reapportion emphasis to provide participants with more practice interpreting

assessment data and using data to make future educational-decisions, assure that participants

were able to conduct the Teacher-Constructed tasks with their students, provide greater

differentiation in resources, and increase the length of the intervention from four to eight weeks.

In this study I did not directly address music teachers’ assessment beliefs in this

intervention, although teachers did engage in discussions about the nature of their beliefs, and

assumptions they held about the purposes and value of assessment. In a future implementation of

the professional development, I will explicitly address music teachers’ beliefs through journal

prompts and reflections based upon both the readings and discussions surrounding specific

readings. I believe this alteration will help music teachers examine the origins of their beliefs

(which may be tied to professional socialization experiences), and dispassionately connect what

could be emotionally charged assessment beliefs from their identity as educators. I also believe

this may help integrate music teachers’ often conflicting narratives about their roles as

conductors and educators.

I will also place greater emphasis on materials and activities connected to 3rd and 4th

STCEAS. In this study, music teachers showed the most growth in Standards 1 and 2 (selecting

and designing assessments), but minimal growth in Standards 3 and 4 (implementing, scoring,

and interpreting assessment data for educational decision making). Participants’ comments about


the intervention also lead me to belief that activities associated with Standard 3 were the most

intellectually challenging for many music teachers. Arguably, Standard 4 is essential to shifting

music teachers’ actual assessment practices. Thus, I might alter the teacher-constructed tasks

associated with Standards 3 and 4 to include greater scaffolding of evaluation and measurement

concepts, ask teachers’ to respond to various scenarios about scoring and interpreting assessment

data, and reflect upon how effective teachers’ subsequent educational decision making (i.e., the

instruction following the assessment) was for their students. Or, I might alter all of the teachers-

constructed tasks to embody a participatory action research project by asking teachers to use the

assessment they designed during the course with one cohort of students, and a previously-utilized

assessment (or no assessment) with another cohort.

While the COVID-19 pandemic prevented music teachers’ from implementing their

assessment with their students due to access, I will design future versions of this professional

development to allow music teachers to utilize their students in more varied ways. For example,

within this study I asked participants to design a rubric-based assessment for students that would

conceivably be implemented in ordinary circumstances. In the future, I will spend more time

emphasizing the integrated nature of assessment and instruction and provide participants with

exemplars for assessments fulfilling a variety of forms and purposes. This way, regardless of

circumstances, teachers will be more likely to utilize their assessment with their students and

maximize the value that authentic practice may hold in shaping assessment beliefs.

The resources and materials participants accessed during this study were largely text-

based articles. In feedback comments, several participants voiced their desire for greater

differentiation in the materials, such as audio files of articles being read (e.g., Audible or other

software that offers oral versions of text) or video-based lectures that surmises the major points


from each week’s readings. I will take the time to locate or create audio versions of articles and

provide other audio- and video-based options for participants who require such accommodations

to complete the coursework. I will also emphasize that participants are only required to access

one or two articles each week so they can discuss them with other participants; this activity was

well-received by participants and may help challenge previously held assessment beliefs.

Additional materials will be offered on a supplemental basis.

Finally, I will change the length of the professional development from four to eight

weeks. Two participants suggested they would have benefitted from more time to complete the

modules, or to engage in the resources more fully. By doubling the time of the intervention, I can

devote one week for participants to read and discuss materials, and one week to synthesize and

apply their knowledge through the Teacher-Constructed tasks. Changing the duration of the

intervention will provide participants more time to engage with materials and to reflect upon the

processes associated with each of the four STCEAS.

Study Limitations and Implementation Challenges

The findings of this study must be contextualized within the limitations of this study,

including the sampling procedure, participant attrition between stages of the study, possible

history effects, and reliability of the CALI instrument.

Sampling Procedure

In March of 2020, I utilized the NAfME Research Assistance Program (RSA) to solicit

participants from a nationwide sample of 19,870 music teachers. Of that number, only 6,309

opened the email, and 247 clicked on the link to read more about the study. It is impossible to

know now, but I suspect the opportunity to send one or two follow-up reminders could have

garnered music teacher attention at the outset of the study. Of those, 108 completed informed


consent, 74 completed the pretest, and 43 completed the study. Due to NAfME’s data collection

procedures for members and the RSA, they were unable to provide demographic information

about the 19,870 members they pulled from their lists (i.e., if participants were full-time K-12

music educators). This is, in part, why I included a pre-screening question. However, it also

means that I am unable to calculate an accurate response rate or fully assess the

representativeness of my participants, because I do not know how many of the music teachers in

the target population met my parameters. Regardless, it is evident that participation of 43

members from a target population of approximately 20,000 educators would result in a quite low

response rate. Participation may have been improved had I utilized a more compelling form of

reciprocity (e.g., monetary compensation) to incentivize both participation and completion of the

study.

Although generalizability is not a goal of intervention designs, it was still important to

ensure that participant characteristics were not significantly different between groups. A low

response rate could conceivably result in a nonresponse bias, where participants in the study

differ from those who chose not to participate. The use of a volunteer sample based upon music

teachers with NAfME membership could also conceivably impact the representativeness of

participants. In the future, researchers might draw a sample of music teachers from a single

district or state to contextualize findings for a specific population; their findings could be

sponsored and distributed by state-level music education associations, which would benefit from

empirical research.

Other music education researchers who have utilized a similar sampling procedure also

reported low response rates; LaCognata (2013) reported a 10% response rate for a sample of

4,500 music teachers, Koerner (2017) was unable to report an official response rate for his


sample of 154 music teachers from ten states, but estimated an approximate response rate of 3%

based upon National Center for Educational Statistics data about teacher populations, and

several other researchers have reported low response rates using this sampling method (Bacala,

2020; Hahn, 2010; Hourigan, 2008). Researchers should take this trend into account when

utilizing the NAfME Research Assistance Service. In the meantime, researchers may lobby for

such changes through channels such as the Society for Research in Music Education (SRME).

Framing changes to NAfME’s Research Assistance Service as a collaborative effort between

national leadership and researchers for the benefit of the profession may be persuasive.

Coupled with the timing of the study occurring during the onset of the COVID-19

pandemic, it is possible that participants in this study are only representative of music teachers

with NAfME membership invested in professional development and assessment. Because

NAfME would not resend the solicitation email, I was strictly limited to the 108 participants who

completed the informed consent document from the initial wave, which limited statistical power.

However, I did meet the statistical assumptions for the analyses employed in this study, and as

discussed in Chapter 4, I did not find any preexisting differences for intervention and control

group participants at the pretesting stage, which implies that the randomization process may have

worked as intended.

In a future replication of this study, I might consider co-designing the professional

development experience with input from teachers. This could potentially increase buy-in and

participation by music teachers and help deliver the most relevant and targeted professional

development. I also suspect strong endorsement (i.e., compulsory participation and completion)

by a school district, state accrediting or licensure body (e.g., a state department of education), or

national organization (e.g., NAfME) may have also helped attract more participants. These


considerations bolster my resolve to continue developing and implementing this assessment-

focused professional development in the future.

Participant Attrition Between Stages

In a similar vein to the sampling procedure, participant attrition between stages of the

study may have served as a major threat to internal validity (i.e., attrition cases in either group

may have altered the posttest results). However, as noted in Chapter 4, those who chose to leave

the study were not significantly different from those who completed the study regarding their

pretest measures of assessment literacy, beliefs, or practices. The design of the study, including

the length of the measures utilized and the time required to complete the MTAW modules, may

have also contributed to the attrition between stages of the study.

History Effects Due to the COVID-19 Pandemic

Unfortunately, there is no way to tell what impact – and to what degree – the concurrence

of this study with the beginning of the first wave of the COVID-19 pandemic may have had on

findings. While I cannot definitively prove COVID-19 was the cause of a 42% (n = 31)

participant attrition, I know through anecdotal evidence (i.e., emails) from participants that the

pandemic placed an immense burden on music teachers, who suddenly shifted all instruction

online during late March and early April 2020, and contributed to at least seven participants’

decision to leave the study. I could also deduce that the sudden shift to online instruction may

have impacted findings with regards to music teachers’ assessment practices, as there was a shift

from emphasis on traditionally utilized forms and functions (e.g., participation, attendance, and

extramusical functions) to a greater emphasis on formative assessment, and written classwork

and individual performances. A future replication of this study – in a post-COVID-19 world –


may reveal greater attainment of the objectives linked to the intervention group, and lower

attrition.

Reliability of the CALI Instrument

Reliability estimates – as measured through a KR20 analysis – of inservice music teacher

responses to the CALI instrument continues to be less than satisfactory. Just as Mertler (2004)

reported, I was unable to reach an acceptable threshold for reliability with my participants (N =

43). This also corresponds to what Ryan (2019) reported for the CALI, and what other

researchers have reported for similar measures, like the TALQ, tKUDA, and ALI (Alkharusi,

2015; Donovan, 2015; Hailaya et al. 2014). However, it is important to consider what internal

consistency measures of a competency instrument conceptually convey; that is, such an

instrument can only be considered “reliable” to the degree that participants hold a similar level of

knowledge (or lack thereof). Psychometricians disagree about the use of traditional internal

consistency estimates for instruments that are not scale-based (Thompson, 2010). Regardless, it

may be prudent, based upon item difficulty and discrimination indices, and item-level distractor

analysis, to consider amending or creating alternative measures for assessment literacy,

especially for music educator populations.

Recommendations for Future Research

This study contributed to an under-investigated area of music education research; music

educator assessment literacy, beliefs, and practices are almost completely represented by

doctoral dissertations, volumes produced after each of the seven International Symposia on

Music Education, and the published research of Austin and Russell (Austin & Russell, 2016,

2017, 2019; Russell & Austin, 2010). Austin and Russell have primarily examined music

teachers’ assessment beliefs, extending the work of Brown and his colleagues in general


education. Music education researchers have not converged on a conceptualization of assessment

practices; and past research has included both forms and functions of assessment in classroom

instruction. Thus, it would be helpful for music education researchers to create an instrument or

utilize a strategy that does not solely rely upon self-reported data. Aside from the possibilities for

future research that may directly address or overcome some of this study’s limitations and

challenges, there are other avenues for studying music teacher assessment as illuminated by my

findings. Chiefly, my recommendations include: (a) development of reliable measures aligned to

music teachers’ assessment literacy, beliefs, and practices, (b) examination of the relationship

between music teachers’ assessment literacy, beliefs, and practices, (c) deliberate partnerships

between researchers and school districts to facilitate intervention-based studies; and (d) further

examination of the role other factors may play in music teachers’ educational decision-making

surrounding assessment.

Developing a reliable measure of music teachers’ assessment literacy is an important step

toward accurately capturing inservice and preservice music teachers’ competency. Perhaps music

researchers should align a future instrument to standards other than the STCEAS, such as music

teacher certification competencies. Or, perhaps NAfME, or another music teacher advocacy

group, could put together a taskforce to identify factors comprising assessment literate music

teachers. Following the last International Symposia of Music Education in 2019 – which was

directly focused on assessment – a task force was assembled to identify such factors but has not

released documentation publicly. As researchers have noted in recent years, the STCEAS needs

an update (Brookhart, 2011; DeLuca et al., 2016; Gotch & French, 2014; Popham, 2019).

Additionally, developing a valid and reliable instrument for evaluating music teachers’

assessment practices is required to perform parametric analyses of music teachers’ attitudes, and


compare them to assessment literacy and beliefs. As previously discussed, the calibration and

perceived dimensions of assessment literacy, beliefs, and practices require further examination.

Future researchers should utilize exploratory and descriptive designs (e.g., qualitative case

studies, explanatory mixed method designs) to operationalize these constructs.

I also believe that researchers should continue examining the relationships between the

three major constructs in this study (i.e., assessment literacy, beliefs, and practices). For

example, researchers could examine whether assessment practices are impacted by a moderated

role between assessment literacy and beliefs. They could also examine the direction and

magnitude of the relationship between assessment literacy and beliefs. Teasing out these

relationships has important implications for rectifying music teachers’ assessment illiteracy. For

example, if assessment beliefs were found to be influenced by music teachers’ assessment

literacy, that could impact the curricular offering of preservice music teachers and adjust the

professional development priorities of administrators overseeing inservice music teachers. These

relationships could be examined using path analysis, confirmatory factor analysis, or structural

equation modelling.

Music education researchers have suggested that accessing teaching populations can be

challenging (Austin, 2018; Koerner, 2017; LaCognata, 2013). Austin noted that the issue of

access is compounded when “researcher [projects do] not support the districts’ policy positions

or curricular priorities” (2018, p. 8). It is little surprise, therefore, that intervention designs

utilizing inservice music teachers are not well represented in our research journals. These

circumstances (i.e., a need for targeted professional development for music teachers, and a desire

from music education researchers to access teaching populations more easily) should be

approached as an opportunity by all stakeholders to form partnerships. Music teachers and


districts would benefit from researchers’ expertise in assessment content, and researchers would

benefit from more intimate access to the population they most often study. Additionally,

assessment is both an understudied area in music education research, and a highly desired area

for professional development by school districts, especially those in the process of adapting

curricula to new state standards aligned to the National Core Arts Standards and/or Model

Cornerstone Assessments (Payne et al., 2019). Finally, use of an online or hybrid delivery system

for professional development could prove useful in alleviating some of the access barriers posed

by face-to-face formats. Music education researchers could even pose such interventions as a

form of participatory action research, and increase teacher buy-in and support for research-based

practices.

Finally, researchers should extend the findings of this study and McMillan’s (2003)

conceptualization of teachers’ educational decision-making surrounding assessment by exploring

the role other factors may play. For example, researchers might investigate how impactful music

teachers’ internalized beliefs, values, knowledge, confidence, self-efficacy, prior experiences

with assessment training, and/or prior socialization experiences are in shaping educational

decision-making. This could take the form of various mixed method designs, or advanced

experimental designs. For example, researchers could use a survey instrument to collect

assessment literacy, belief, and/or practice data, and demographic information. In a follow-up

case-study or grounded-theory approach, researchers could conduct extensive semi-structured

interviews with teachers, administrators, parents, and other stakeholders, as well as classroom

observations to verify self-reported data, and build a cohesive theory of music teachers’

assessment decision-making. Then researchers could utilize assessment literacy, belief, and/or

practice data in a confirmatory factor analysis of the grounded model. Or researchers could


conduct a study where they collect data from district teachers in the same way described within

this study, but also delve into the experiences of teachers in both groups using a multiple-case

study approach. This would allow for corroboration of findings and cross-case analysis between

assigned groups. It would be helpful for researchers, policymakers, and inservice teachers to

know approximately how much variance in assessment decision-making is accounted for by

other factors.

Conclusion

In this study, I found that music teachers’ assessment literacy can be significantly

increased through online professional development, that music teachers’ assessment literacy and

beliefs are moderately related, and that music teachers’ beliefs and specific assessment practices

are moderately related. Further, I found that music teachers in the intervention valued receiving

content-area focused professional development. While there are still issues surrounding the

reliability of instrumentation for measuring assessment literacy, these findings point to possible

solutions for alleviating overall music teacher assessment illiteracy, including but not limited to

national support for developing music teacher-specific literacy standards, development and

dissemination of professional development by national organizations (e.g., NAfME), and

intentional collaboration between music education researchers and school districts. School

districts, inservice music teachers, and music teacher preparation programs may benefit from use

of targeted training in assessment for their respective teacher populations. In the future,

researchers should continue to examine the dimensionality, direction, and magnitude of the

relationships between music teachers’ assessment literacy, beliefs, and practices. The COVID-19

pandemic may have played a role in the impact of this specific intervention; yet, a silver lining of

this timing is that there has never been a better time to reimagine teacher development, and work


toward improving music teachers’ assessment literacy, beliefs, and practices. Ultimately, these

concepts affect the quality of students’ learning experiences. In the months and years to follow

the COVID-19 pandemic, music teachers, researchers, and other stakeholders will have the

opportunity to make substantive changes – whether curricular or policy-based – to the

educational endeavor, and truly unlock the potential of effective assessment.


References

Adams, K. A. & Lawrence, E. K. (2019). Research methods, statistics, and applications (2nd

ed.). SAGE Publications.

Airasian, P. W. (2004). Classroom assessment: Concepts and applications (5th ed.). McGraw-

Hill.

Allal, L. (2013). Teachers’ professional judgement in assessment: a cognitive act and a socially

situated practice. Assessment in Education: Principles, Policy & Practice, 20(1), 20-34.

https://doi.org/10.1080/0969594X.2012.736364

Allsup, R. E., & Benedict, C. (2008). The problems of band: An inquiry into the future of

instrumental music education. Philosophy of Music Education Review, 16(2), 156–173.

www.jstor.org/stable/40327299

American Federation of Teachers, National Council on Measurement in Education, and the

National Education Association. (1990). The standards for teacher competence in

educational assessment of students. Retrieved from

http://www.unl.edu/buros/article3.html.

Austin, J. R. (2018). In defense of researcher access. Journal of Music teacher Education, 27, 7-

10. https://doi.org/10.1077/1057083717748707

Austin, J. R., & Reinhardt, D. (1999). Philosophy and advocacy: An examination of preservice

music teachers’ beliefs. Journal of Research in Music Education, 47(1), 18–30.

https://doi.org/ 10.2307/3345825

Austin, J. R. & Russell, J. (2016). The status of assessment instruction in U.S. graduate music

education programs: Access, curriculum, and outcomes. Paper presented at the 32nd

World Conference of the International Society of Music Education in Glasgow, Scotland.


Austin, J. R. & Russell, J. (2017). Secondary music teachers’ assessment practices: The role of

occupational identity and assessment conceptions. Paper presented at the Sixth

International Symposium on Assessment in Music Education in Birmingham, England.

Austin, J. R. & Russell, J. (2019). Preservice music teachers’ assessment education: Relations

with assessment conceptions, assessment confidence, projected assessment practices, and

occupational identity. Paper presented at the Seventh International Symposium on

Assessment in Music Education in Gainesville, Florida, USA.

Aschbacher, P. R. (1999, December). Developing indicators of classroom practice to monitor

and support school reform. University of California, Los Angeles: CRESST Technical

Report 513.

Assessment in Music Education. (2009). Retrieved from https://nafme.org/about/position-

statements/assessment-in-music-education-position-statement/assessment-in-music-

education/.

Azis, A. (2015). Conceptions and practices of assessment: A case of teachers representing

improvement conception. TEFLIN Journal, 26(2), 129-152.

https://doi.org/10.16639/TEFLINJOURNAL.V26I2/129-154

Bailey, S., Henricks, S., & Applewhite, S. (2015). Student perspectives of assessment strategies

in online courses. Journal of Interactive Online Learning, 13(3), 112-125.

Baccala, A. C. (2020). Elements of comprehensive musicianship: A survey addressing the

attitudes and approaches of middle and high school choral directors [Doctoral

dissertation, Auburn University]. Auburn University Archive.

https://etd.auburn.edu/bitstream/handle/10415/7172/Elements%20of%20Comprehensive

%20Musicianship-


%20A%20Survey%20Addressing%20the%20Attitudes%20and%20Approaches%20of%

20Middle%20School%20and%20High%20School%20Choral%20Directors%20by%20Al

lison%20Baccala.pdf?sequence=2

Baker, W. (2013). Questioning assumptions. Vivienne: a case study of e-learning in music

education. Australian Journal of Music Education, (1), 13-22.

Baldwin, S. J., Ching, Y., & Friesen, N. (2018). Online course design and development among

college and university instructors: An analysis using grounded theory. Online Learning

Journal, 22(2), 157-171. https://doi.org/10.24059/olj.v22i2.1212

Barnes, N., Fives, H., & Dacey, C.M. (2017) U.S. teachers' conceptions of the purposes of

assessment. Teaching and Teacher Education, 65, 107-116.

https://doi.org/10.1016/j.tate.2017.02.017

Berg, M. H. (2014). Preservice music teacher preparation for the conductor-educator role. Barret,

J.R. and Webster, P.R. (Eds.), The musical experience: Rethinking music teaching and

learning. (pp. 261-283). Oxford Scholarship Online. https://doi.org/

10.1093/acprof:oso/9780199363032.003.0015

Biasutti, M., Frate, S., & Concina, E. (2019). Music teachers’ professional development:

Assessing a three-year collaborative online course. Music Education Research, 21(1),

116-133. https://doi.org/10.1080/14613808.2018.1534818

Boling, E. C., Hough, M., Krinsky, H., Saleem, H., & Stevens, M. (2012). Cutting the distance in

distance education: Perspectives on what promotes positive, online learning experiences.

Internet and Higher Education, 15, 118-126.


Box, C., Skoog, G., & Dabbs, J.M. (2015). A case study of teacher personal practice assessment

theories and complexities of implementing formative assessment. American Education

Research Journal, 52(5), 956-983. https://doi.org/10.3102/0002831215587754

Brookhart, S. (2001). The “Standards” and classroom assessment research. Paper presented at

the 53rd Annual Meeting of the American Association of Colleges for Teacher Education

in Dallas, Texas, USA.

Brookhart, S. (2011). Educational assessment knowledge and skills for teachers. Educational

Measurement: Issues & Practice, 30(1), 3-12.

Brown, G. (2004). Teachers’ conceptions of assessment: implications for policy and professional

development. Assessment in Education: Principles, Policy & Practice, 11(3), 301-318.

https://doi.org/10.1080/0969594042000304609

Brown, G. (2006). Teachers’ conceptions of assessment: Validation of an abridged version.

Psychological Reports, 99(1), 166-170. https://doi.org/10.2466/pr0.99.1.166-170

Brown, G., Chaudhry, H., & Dhamiji, R. (2015). The impact of an assessment policy upon

teachers' self-reported assessment beliefs and practices: A quasi-experimental study of

Indian teachers in private schools. International Journal of Educational Research, 71, 50-

64. https://doi.org/10.1016/j.ijer.2015.03.001

Brown, G., Harris, L. R., & Harnett, J. (2012). Teacher beliefs about feedback within an

assessment for learning environment: Endorsement of improved learning over student

well-being. https://doi.org/10.1016/j.tate.2012.05.003

Brown, G., Hui, S., Yu, F., & Kennedy, K. (2011). Teachers’ conceptions of assessment in

Chinese contexts: A tripartite model of accountability, improvement, and irrelevance.


International Journal of Educational Research, 50, 307-320.

https://doi.org/10.1016/j.ijer.2011.10.003

Brown, G., Irving, S., Peterson, E., Hirschfeld, G. (2009). Use of interactive - informal

assessment practices: New Zealand secondary students’ conceptions of assessment.

Learning and Instruction, 19, 97-111. https://doi.org/10.1016/j.learninstruc.2008.02.003

Brown, G., Kennedy, K., Fok, P., Chan, J., & Yu, M. (2009). Assessment for student

improvement: understanding Hong Kong teachers’ conceptions and practices of

assessment. Assessment in Education: Principles, Policy & Practice, 16(3), 347-363.

https://doi.org/10.1080/09695940903319739

Brown, G., Lake, R., & Matters, G. (2011). Queensland teachers’ conceptions of assessment:

The impact of policy priorities on teacher attitudes. Teaching and Teacher Education, 27,

210-220. https://doi.org/10.1016/j.tate.2010.08.003

Brown, G. & Michaelides, M. (2011). Ecological rationality in teachers’ conceptions of

assessment cross samples from Cyprus and New Zealand. European Journal of

Psychology & Education, 26, 319-337. https://doi.org/10.1007/s10212-010-0052-3

Burnaford, G. (1999). Teacher action research as professional development in schools: four

paths to change. School wide inquiry: a self-study of an “outside” teacher researcher.

Opinion Paper presented at the Annual Meeting of the American Educational Research

Association, Montreal, PQ.

Campbell, D.T., & Stanley, J.C. (1963). Experimental and quasi-experimental designs for

research. Rand McNally.

Chen, S. (2007). Instructional design strategies for intensive online courses: an objectivist-

constructivist blended approach. Journal of Interactive Online Learning, 6(1), 1-15.


Cherasaro, T. L., Reale, M. L., Haystead, M., Marzano, R. J. (2015). Instructional improvement

cycle: A teacher's toolkit for collecting and analyzing data on instructional strategies.

ERIC Clearinghouse.

Colwell, R. (2008). Assessment in music education: integrating curriculum, theory, and practice;

proceedings of the 2007 Florida Symposium on Assessment in Music Education, from

March 29-31, 2007, at the Univ. of Florida Hilton Hotel and Conference Center in

Gainesville, Florida. (K. Albert & T. S. Brophy, Eds.). Chicago, IL: GIA.

Conway, C. (2002). Perceptions of beginning teachers, their mentors, and administrators

regarding preservice music teacher preparation. Journal of Research in Music Education,

50(1), 20-36.

Conway, C. (2012). Ten years later: Teachers reflect on “perceptions of beginning teachers, their

mentors, and administrators regarding preservice music teacher preparation.” Journal of

Research in Music Education, 60(3), 324-338.

https://doi.org/10.1177/0022429412453601

Crusan, D., Plakans, L., & Gebril, A. (2016). Writing assessment literacy: Surveying second

language teachers' knowledge, beliefs, and practices. Assessing Writing, 28, 43-56.

https://doi.org/10.1016/j.asw.2016.03.001

Darling-Hammond, L., Chung, R., & Frelow, F. (2002). Variation in teacher preparation:

How well do different pathways prepare teachers to teach? Journal of Teacher

Education, 53(4), 286-302.

Darling-Hammond, L. (2006). Powerful teacher education: Lessons from exemplary programs.

Jossey-Bass.

DeLuca, C., & Klinger, D. A. (2010). Assessment literacy development: Identifying gaps


in teacher candidates’ learning. Assessment in Education: Principles, Policy, &

Practice, 17(4), 419-438.

DeLuca, C., Klinger, D., Searle, M., & Schula, L. (2010). Developing a curriculum for

assessment education. Assessment Matters, 1, 133-156.

Deneen, C. & Brown, G. (2016). The impact of conceptions of assessment on assessment literacy

in a teacher education program. Cogent Education, 3, 1-14.

https://doi.org/10.1080/2331186X.2016.1225380.

Denis, J.M. (2018). Assessment in music: A practitioner introduction to assessing students.

Update, 36(3), 20-28. https://doi.org/10.1177/8755123317741489

Desimone, L. (2009). Improving impact studies of teachers’ professional development: Toward

better conceptualizations and measures. Educational Researcher, 38(3), 181-199.

https://doi.org/10.3102/0013189X-8331140

Dietz-Uhler, B. & Hurn, J. (2013). Using learning analytics to predict (and improve) student

success: a faculty perspective. Journal of Interactive Online Learning, 12(1), 17-26.

Donovan, C. (2018). Rasch Analysis of the teachers' knowledge and use of data and assessment

(tKUDA) measure. Journal of Applied Measurement, 19(1), 76-92.

Donovan, C. (2015). Measuring teachers' knowledge and use of data assessments: Creating a

measure as a first step toward effective professional development (Doctoral dissertation).

University of Denver, Denver. (10017935).

Earl, L. & Katz, S. (2006). Rethinking classroom assessment with purpose in mind: Assessment

for learning, assessment as learning, assessment of learning. Winnipeg: Manitoba

Education, Citizenship and Youth.


Fan, Y., Wang, T., Wang, K. (2011). A web-based model for developing assessment literacy of

secondary in-service teachers. Computers & Education, 57, 1727-1740.

https://doi.org/10.1016/j.compedu.2011.03.006

Fulmer, G., Lee, I., Tan, K. (2015). Multi-level model of contextual factors and teachers’

assessment practices: an integrative review of research. Assessment in Education:

Principles, Policy & Practice, 22(4), 475-494.

https://doi.org/10.1080/0969594X.2015.1017445

Gareis, C. R., & Grant, L. W. (2015). Assessment literacy for teacher candidates: A

focused approach. Teacher Educators' Journal, 4-21.

Goldberg, G. L., & Roswell, B. S. (1998). Perception and practice: The impact of teachers'

scoring experience on performance-based instruction and classroom assessment. Paper

presented at the annual meeting of the American Educational Research Association, San

Diego. (ERIC Document Number ED 420 670)

Great Schools Partnership (2013). Professional Development Definition. Retrieved from

https://www.edglossary.org/professional-development/.

Gutierrez, S.L. (2014). From National Standards to classrooms: A case study of middle level

teachers' assessment knowledge and practices (Doctoral dissertation). Western Michigan

University (https://scholarworks.wmich.edu/dissertations/245).

Hahn, K. R. (2010). Inclusion of students with disabilities: Preparation and practices of music

educators (Publication No. 3420149) [Doctor Dissertation]. Retrieved from ProQuest

Dissertations & Theses A&I.


Hanzlik, T. (2001). An examination of Iowa high school instrumental band directors' assessment

practices and attitudes toward assessment (3009721)[Doctoral dissertation]. University of

Nebraska, Lincoln.

Herman, J. L. & Baker, E. L. (2009). Assessment policy: Making sense of the babel. In G. Sykes,

B. Schneider, & D. N. Plank (Eds.), Handbook of education policy research (p. 176-190).

Routledge.

Heritage, M. (2007). Formative assessment: What do teachers need to know and do? Phi

Delta Kappan, 89(2).

Hidri, S. (2015). Conceptions of assessment: Investigating what assessment means to secondary

and university teachers. Arab Journal of Applied Linguistics, 1(1), 19-43.

Hill, K. W. (1999). A descriptive study of assessment procedures, assessment attitudes, and

grading policies in selected public high school band performance classrooms in

Mississippi (9935693) [Doctoral dissertation]. The University of Southern Mississippi.

ProQuest Dissertations and Theses A&I.

Hoover, N. R., & Abrams, L. M. (2013). Teachers’ instructional use of summative

student assessment data. Applied Measurement in Education, 26(3), 219–231.

Hourigan, R. M. (2008). Teaching strategies for performers with special needs. Teaching

Music, 15, 26-29.

Isbell, D. (2008). Musicians and teachers: The socialization and occupational identity of

preservice music teachers Journal of Research in Music Education, 56(2), 162-178.

https://doi.org/10.1177/0022429408322853

Jamil, F. & Hamre, B. (2018). Teacher reflection in the context of an online professional

development course: applying principles of cognitive science to promote teacher


learning. Action in Teacher Education, 40(2), 220-236.

https://doi.org/10.1080/01626620.2018.1424051

Johnson, E. (2014). Preservice music teachers’ occupational identity in a beginning conducting

course. Journal of Education and Training Studies, 2(3).

https://doi.org/10.11114/jets.v2i3.422

Kancianic, P.M. (2006). Classroom assessment in U.S. high school band programs: methods,

purposes, & influences (3222315) [Doctoral Dissertation]. University of Maryland,

College Park. ProQuest Dissertations and Theses A&I.

Koerner, B. D. (2017). Beginning music teacher mentoring: Impact on reflective practice,

teaching efficacy, and professional commitment (Publication No. 10642603) [Doctoral

dissertation, University of Colorado Boulder]. ProQuest Dissertations and Theses Global.

Kotora, E. J. (2005). Assessment practices in the choral music classroom: A survey of Ohio high

school choral music teachers and college choral methods professors. Contributions to

Music Education, 32, 65–80.

Koutsoupidou, T. (2014). Online distance learning and music training: Benefits, drawbacks, and

challenges. Open Learning: The Journal of Open, Distance and e-Learning, 29(3), 243-

255. https://doi.org/10.1080/02680513.2015.1011112

Leong, W. (2014). Understanding classroom assessment in dilemmatic spaces: Case studies of

Singaporean music teachers’ conceptions of classroom assessment. Music Education

Research, 16(4), 454-470. https://doi.org/10.1080/14613808.2013.878325

LaCognata, J.P. (2010). Current student assessment practices of high school band directors

(3436343) [Doctoral Dissertation]. University of Florida, Gainesville.


LaCognata, J. P. (2013). Current student assessment practices of high school band directors in

the United States. In T. Brophy & A. Lehmann-Wermser (Eds.), Music assessment across

cultures and continents: The culture of shared practice (pp. 109-128). GIA.

Ludwig, N. (2013). Exploring the relationship between K-12 public school teachers’ conceptions

of assessment and their classroom assessment confidence levels (3579798) [Doctoral

dissertation]. Regent University. ProQuest Dissertations and Theses A&I.

Mantie, R. (2012). Band and/as music education: Antinomies and the struggle for legitimacy.

Philosophy of Music Education Review, 20(1), 63-81. https://doi.org/

10.2979/philmusieducrevi.20.1.63

May, B. N., Willie, K., Worthen, C., & Pehrson, A. (2017). An analysis of state music education

certification and licensure practices in the United States. Journal of Music Teacher

Education, 27(1), 65-88. https://doi.org/10.1177/1057083717699650

McClung, A.C. (1996). A descriptive study of learning assessment and grading practices in the

high school choral music performance classroom (9700217) [Doctoral dissertation]. The

Florida State University, Tallahassee. ProQuest Dissertations and Theses A&I.

McConnell, T., Parker, J., Eberhardt, J., Koehler, M., & Lundeberg, M. (2012). Virtual

professional learning communities: Teachers’ perceptions of virtual versus face-to-face

professional development. Journal of Science Education and Technology, 22, 267-277.

https://doi.org/10.1007/s10956-012-9391-y

McMillan, J. H. (2001). Secondary teachers’ classroom assessment and grading practices.

Educational Measurement: Issues and Practice, 20, 20–32.

McMillan, J. H., & Nash, S. (2000, April). Teacher classroom assessment and grading practice

decision making. Paper presented at the annual meeting of the National Council on


Measurement in Education, New Orleans.

McQuarrie, Sarah H. and Sherwin, Ronald G. (2013) Assessment in music education:

Relationships between classroom practice and professional publication topics. Research

& Issues in Music Education, 11(1).

Mertler, C. (2000). Teacher-centered fallacies of classroom assessment validity and reliability.

Mid-Western Educational Researcher, 13(4), 29-35.

Mertler, C. (2004). Secondary teachers' assessment literacy: Does classroom

experience make a difference? American Secondary Education, 49-64.

Mertler, C. (2009). Teachers’ assessment knowledge and their perceptions of the impact of

classroom assessment professional development. Improving Schools, 12(1), 101–113.

Mertler, C.A. & Campbell, C.S. (2005). Measuring teachers' knowledge and application

of classroom assessment concepts: Development of the Assessment Literacy

Inventory. Proceeding from American Educational Research Association, Montreal,

Quebec, Canada.

Montgomery, A., Mousavi, A., Carbonaro, M., Hawyard, D., & Dunn, W. (2019). Using learning

analytics to explore self-regulated learning in flipped blended learning music teacher

education. British Journal of Educational Technology, 50(1), 114-127.

https://doi.org/10.111/bjet.12590

National Education Association (2019). Professional Development. Retrieved from

http://www.nea.org/home/30998.htm

Nierman, G. E. & Colwell, R. (2019). Perspectives from North America. In T. S. Brophy (Ed.),

The Oxford Handbook of Assessment Policy and Practice in Music Education (pp. 173-

196). Oxford Press.


Nyberg, J. (2016). You are seldom born with a drum kit in your hands: Music teachers'

conceptualizations of knowledge and learning within music education as an assessment

practice. Systems of Practice and Action Research, 29, 235-259.

https://doi.org/10.1007/s11213-015-9352-3

Opre, D. (2015). Teachers’ conceptions of assessment. Procedia: Social and Behavioral

Sciences, 209, 229-233.

Oosterhof, A. (1995). An extended observation of assessment procedures used by selected

public school teachers. Paper presented at the annual meeting of the American

Educational Research Association, San Francisco. (ERIC Document Number ED 390

937)

Pastore, S. & Pentassuglia, M. (2016). Teachers’ and students’ conceptions of assessment within

the Italian higher education system. Practitioner Research in Higher Education, 10(1),

109-120.

Payne, P., Burrack, F., Parkes, K., & Wesolowski, B. (2019). An emerging process of assessment

in music education. Music Educators Journal, 105(3), 36-44.

https://doi.org/10.1177/0027432118818880

Pellegrino, K., Conway, C., & Russell, J. (2015). Assessment in performance-based secondary

music classes. Music Educators Journal, 102(1), 48-55.

https://doi.org/10.1177/0027432115590183

Perry, M. L. (2013). Teacher and principal assessment literacy (Publication No.

3568118)[Doctoral dissertation]. University of Montana. ProQuest Dissertations and

Theses A&I.

Pike, P. (2017). Improving music teaching and learning through online service: a case study of a


synchronous online teaching internship. International Journal of Music Education, 35(1),

107-117. https://doi.org/10.1177/0255761415613534

Pishghadam, R., Adamson, B., Sadafian, S., & Kan, F. (2014). Conceptions of assessment and

teacher burnout. Assessment in Education: Principles, Policy & Practice, 21(1), 34-51.

https://doi.org/10.1080/0969594X.2013.817382

Popham, W. J. (2009). Assessment literacy for teachers: Faddish or fundamental? Theory

into Practice, 48(1), 4-11.

Popham, W. (2011). Assessment literacy overlooked: a teacher educator’s confession. The

Teacher Educator, 46(4), 265-273. https://doi.org/10.1080/08878730.2011.605048

Prichard, S. (2018). A profile of high-stakes assessment practices in music teacher education.

Journal of Music Teacher Education, 27(3), 94-105.

https://doi.org/10.1177/1057083717750079

Remesal, A. (2010). Primary and secondary teachers’ conceptions of assessment: a qualitative

study. Teaching and Teacher Education, 27, 472-482.

https://doi.org/10.1016/j.tate.2010.09.017

Richerme, L.K. (2016). Measuring music education: A philosophical investigation of the model

cornerstone assessments. Journal of Research in Music Education, 64(3), 274-293.

https://doi.org/10.1177/0022429416659250

Rowan, B., & Correnti, R. (2009). Studying reading instruction with teacher logs: Lessons from

the study of instructional improvement. Educational Researcher, 38(2), 120-131.

Russell, J. (2011). Assessment and case law: Implications for the grading practices of music

educators. Music Educators Journal, 97(3), 35-39.

https://doi.org/10.1177/0027432110392051


Russell, J. A. (2018). Statistics in Music Education Research. Oxford University Press.

Russell, J. A. & Austin, J. R. (2010). Assessment practices of secondary music teachers. Journal

of Research in Music Education, 58(1), 37-54.

https://doi.org/10.1177/0023429409360062

Ryan, K.A. (2018). An investigation of pre-service teacher assessment literacy and assessment

confidence: measure development and EdTPA performance (Publication No.

10871606)[Doctoral dissertation]. Kent State University. ProQuest Dissertations and

Theses A&I.

Sears, M. (2002). Assessment in the instrumental music classroom: Middle school methods &

materials (Publication No.1409387)[Master’s thesis]. University of Massachusetts.

ProQuest Dissertations and Theses A&I.

Sherman, C. (2006). A study of current strategies and practices in the assessment of individuals

in high school bands (Publication No. 3237089)[Doctoral dissertation]. Teachers College

Columbia University. ProQuest Dissertations and Theses A&I.

Siegel, M. A. & Wissehr C. (2011). Preparing for the plunge: Preservice teachers’

assessment literacy. Journal of Science Teacher Education, 22, 371–391.

Simanton, E. (2000). Assessment and grading practices among high school band teachers in the

United States: A descriptive study (Publication No. 9986536)[Doctoral dissertation].

University of North Dakota. ProQuest Dissertations and Theses A&I.

Stake, R. E. (2010). Qualitative research: Studying how things work. Guilford Press.

Stiggins, R. (1991). Assessment literacy. The Phi Delta Kappan, 72(7), 534–539.

Stiggins, R. (2002). Assessment crisis: the absence of assessment for learning. The Phi Delta

Kappan, 83(10), 758-765. https://doi.org/10.1177/003172170208301010


Stiggins, R. (2004). New assessment beliefs for a new school mission. The Phi Delta Kappan,

86(1), 22-27. https://doi.org/10.1177/003172170408600106

Stiggins, R. (2005). From formative assessment to assessment for learning: a path to success in

standards-based schools. The Phi Delta Kappan, 87(4), 324-328.

https://doi.org/10.1177/003172170508700414

Stiggins, R. (2014). Improve assessment literacy outside of schools, too. The Phi Delta Kappan,

96(2), 67-72. https://doi.org/10.1177/0031721714553413

St. Pierre, N., & Wuttke, B. (2017). Standards-based grading practices among practicing music

educators: prevalence and rationale. Update, 35(2), 30-37.

https://doi.org/10.1177/8755123315604468

Thompson, N. (2010). Kr-20. In N. J. Salkind (Ed.), Encyclopedia of research design (p. 668).

SAGE Publications, Inc. https://doi.org/10.4135/9781412961288.n205

UNESCO. (2020). COVID-19 impact on education.

https://en.unesco.org/covid19/educationresponse

vanOostveen, R., Desjardins, F., & Bullock, S. (2019). Professional development learning

environments (PDLEs) embedded in a collaborative online learning environment

(COLE): Moving towards a new conception of online professional learning. Education

Information and Technology, 24, 1863-1900. https://doi.org/10.1007/s10639-018-9686-6

Walls, K. (2008). Distance learning in graduate music teacher education: Promoting professional

development and satisfaction of music teachers. Journal of Music Teacher Education,

18(1), 55-66. https://doi.org/10.1177/1057093708323137

Wasserman, E. & Migdal, R. (2019). Professional development: Teachers’ attitudes in online and

traditional training courses. Online Learning Journal, 23(1), 132-143.


https://doi.org/10.24057/olj.v23il.1299

Wiggins, G. & McTighe, J. (2005). Understanding by Design (2nd ed.). Alexandria, VA: ‘

Association for Supervision and Curriculum Design.

Willis, J., Adie, L., & Klenowski, V. (2013). Conceptualising teachers' assessment literacies in

an era of curriculum and assessment reform. Australian Educational Researcher, 40(2),

241-256.

Yurkofsky, M., Blum-Smith, S., Brennan, K. (2019). Expanding outcomes: Exploring varied

conceptions of teacher learning in an online professional development experience.

Teaching and Teacher Education, 82, 1-13. https://doi.org/10.1016/j.tate.2019.03.002

Zepeda, S. J. (2019). Professional development: What works (3rd ed.). Retrieved from

http://ebookcentral.proquest.com

Zhang, Z. (1996). Teacher assessment competency: A Rasch model analysis. Paper presented at

the annual meeting of the American Educational Research Association, New York.

(ERIC Document Number ED 400 322)


Appendix A

IRB Approval Documentation


Appendix B

Standards for Teacher Competence in Educational Assessment of Students (STCEAS)

Standards & Corresponding CALI Items

1. Teachers should be skilled in choosing assessment methods appropriate for instructional

decisions.

Items: 1, 2, 3, 4, 5

2. Teachers should be skilled in developing assessment methods appropriate for


Items: 6, 7, 8, 9, 10

3. The teacher should be skilled in administering, scoring and interpreting the results of

both externally-produced and teacher-produced assessment methods.

Items: 11, 12, 13, 14, 15

4. Teachers should be skilled in using assessment results when making decisions about

individual students, planning teaching, developing curriculum, and school

improvement.

Items: 16, 17, 18, 19, 20

5. Teachers should be skilled in developing valid pupil grading procedures which use

pupil assessments.

6. Teachers should be skilled in communicating assessment results to students, parents,

other lay audiences, and other educators.

7. Teachers should be skilled in recognizing unethical, illegal, and otherwise

inappropriate assessment methods and uses of assessment information.

*Bolded standards and items were used in the measure for this study.


Appendix C

Original CALI Instrument

by Dr. Craig Mertler,

Bowling Green State University

[Adapted from the Teacher Assessment Literacy Questionnaire (1993), by Barbara S. Plake &

James C. Impara, University of Nebraska-Lincoln, in cooperation with The National Council on

Measurement in Education & the W.K. Kellogg Foundation]

Directions: Please read each item carefully and select the response you think is the best one by

shading the corresponding circle. Even if you are not sure of your choice, but you think you

know which is best, mark that response.

PART I

1. What is the most important consideration in choosing a method for assessing student

achievement?

❏ The ease of scoring the assessment.

❏ The ease of preparing the assessment.

❏ The accuracy of assessing whether or not instructional objectives were attained.

❏ The acceptance by the school administration.

2. When scores from a standardized test are said to be “reliable,” what does it imply?

❏ Student scores from the test can be used for a large number of educational decisions.

❏ If a student retook the same test, he or she would get a similar score on each retake.

❏ The test score is a more valid measure than teacher judgments.

❏ The test score accurately reflects the content of what was taught.

3. Mrs. Bruce wished to assess her students' understanding of the method of problem solving she

had been teaching. Which assessment strategy below would be most valid?

❏ Select a textbook that has a "teacher's guide" with a test developed by the authors.

❏ Develop an assessment consistent with an outline of what she has actually taught in the

class.

❏ Select a standardized test that provides a score on problem solving skills.

❏ Select an instrument that measures students' attitudes about problem solving strategies.

4. What is the most effective use a teacher can make of an assessment that requires students to

show their work (e.g., the way they arrived at a solution to a problem or the logic used to arrive

at a conclusion)?

❏ Assigning grades for a unit of instruction on problem solving.

❏ Providing instructional feedback to individual students.

❏ Motivating students to attempt innovative ways to solve problems.

❏ None of the above.


5. Ms. Green, the principal, was evaluating the teaching performance of Mr. Williams, the fourth

grade teacher. One of the things Ms. Green wanted to learn was if the students were being

encouraged to use higher order thinking skills in the class. What documentation would be the

most valid to help Ms. Green to make this decision?

❏ Mr. Williams’ lesson plans.

❏ The state curriculum guides for fourth grade.

❏ Copies of Mr. Williams’ unit tests or assessment strategies used to assign grades.

❏ Worksheets completed by Mr. Williams’ students, but not used for grading.

6. A teacher wants to document the validity of the scores from a classroom assessment strategy

she plans to use for assigning grades on a class unit. What kind of information would provide the

best evidence for this purpose?

❏ Have other teachers judge whether the assessment strategy covers what was taught.

❏ Match an outline of the instructional content to the content of the actual assessment.

❏ Let students in the class indicate if they thought the assessment was valid.

❏ Ask parents if the assessment reflects important learning outcomes.

7. Which of the following would most likely increase the reliability of Mrs. Lockwood's multiple

choice end-of-unit examination in physical science?

❏ Use a blueprint to develop the test questions.

❏ Change the test format to true-false questions.

❏ Add more items like those already on the test.

❏ Add an essay component.

8. Ms. Gregory wants to assess her students' skills in organizing ideas rather than just repeating

facts. Which words should she use in formulating essay exercises to achieve this goal?

❏ compare, contrast, criticize

❏ identify, specify, list

❏ order, match, select

❏ define, recall, restate

9. Mr. Woodruff wanted his students to appreciate the literary works of Edgar Allen Poe. Which

of his test items shown below will best measure his instructional goal?

❏ "Spoke the raven, nevermore." comes from which of Poe's works?

❏ True or False: Poe was an orphan and never knew his biological parents.

❏ Edgar Allen Poe wrote: 1. Novels 2. Short stories 3. Poems 4. All of the above.

❏ Discuss briefly your view of Poe's contribution to American literature.

10. Several students in Ms. Atwell's class received low scores on her end-of-unit test covering

multi-step story problems in mathematics. She wanted to know which students were having

similar problems so she could group them for instruction. Which assessment strategy would be

best for her to use for grouping students?

❏ Use the test provided in the "teacher's guide."


❏ Have the students take a test that has separate items for each step of the process.

❏ Look at the student's records and standardized test scores to see which topics the students

had not performed well on previously.

❏ Give students story problems to complete and have them show their work.

11. Many teachers score classroom tests using a 100-point percent correct scale. In general, what

does a student's score of 90 on such a scale mean?

❏ The student answered 90% of the items on this test correctly.

❏ The student knows 90% of the instructional content of the unit covered by this test.

❏ The student scored higher than 90% of all the students who took the test.

❏ The student scored 90% higher than the average student in the class.

12. Students in Mr. Jakman's science class are required to develop a model of the solar system as

part of their end-of-unit grade. Which scoring procedure below will maximize the objectivity of

assessing these student projects?

❏ When the models are turned in, Mr. Jakman identifies the most attractive models and

gives them the highest grades, the next most attractive get a lower grade and so on.

❏ Mr. Jakman asks other teachers in the building to rate each project on a 5-point scale

based on their quality.

❏ Before the projects are turned in, Mr. Jakman constructs a scoring key based on the

critical features of the projects as identified by the highest performing students in the

class.

❏ Before the projects are turned in, Mr. Jakman prepares a model or blueprint of the

critical features of the product and assigns scoring weights to these features. The models

with the highest scores receive the highest grade.

13. At the close of the first month of school, Mrs. Friend gives her fifth grade students a test she

developed in social studies. Her test is modeled after a standardized social studies test. It presents

passages and then asks questions related to understanding and problem definition. When the test

was scored, she noticed that two of her students—who had been performing well in their class

assignments—scored much lower than other students. Which of the following types of additional

information which would be most helpful in interpreting the results of this test?

❏ The gender of the students.

❏ The age of the students.

❏ Reliability data for the standardized social studies test she used as the model.

❏ Reading comprehension scores for the students.

14. Frank, a beginning fifth grader, received a G. E. (grade equivalent score) of 8.0 on the

Reading Comprehension subtest of a standardized test. This score should be interpreted to mean

that Frank

❏ can read and understand 8th grade reading level material.

❏ scored as well as a typical beginning 8th grader scored on this test.

❏ is performing in Reading Comprehension at the 8th grade level.


❏ will probably reach maximum performance in Reading Comprehension at the beginning

of the 8th grade.

15. When the directions indicate each section of a standardized test is timed separately, which of

the following is acceptable test-taking behavior?

❏ John finishes the vocabulary section early; he then rechecks many of his answers in that

section.

❏ Mary finishes the vocabulary section early; she checks her answers on the previous test

section.

❏ Jane finishes the vocabulary section early; she looks ahead at the next test section but

does not mark her answer sheet for any of those items.

❏ Bob did not finish the vocabulary section; he continues to work on that section when the

testing time is up.

16. Ms. Camp is starting a new semester with a factoring unit in her Algebra I class. Before

beginning the unit, she gives her students a test on the commutative, associative, and distributive

properties of addition and multiplication. Which of the following is the most likely reason she

gives this test to her students?

❏ The principal needs to report the results of this assessment to the state testing director.

❏ Ms. Camp wants to give the students practice in taking tests early in the semester.

❏ Ms. Camp wants to check for prerequisite knowledge in her students before she begins

the unit on factoring.

❏ Ms. Camp wants to measure growth in student achievement of these concepts, and scores

on this test will serve as the students' knowledge baseline.

17. To evaluate the effectiveness of the mathematics program for her gifted first graders, Ms.

Allen gave them a standardized mathematics test normed for third graders. To decide how well

her students performed, Ms. Allen compared her students' scores to those of the third-grade norm

group. Why is this an incorrect application of standardized test norms?

❏ The norms are not reliable for first graders.

❏ The norms are not valid for first graders.

❏ Third grade mathematics items are too difficult for first graders.

❏ The time limits are too short for first graders.

18. When planning classroom instruction for a unit on arithmetic operations with fractions,

which of these types of information have more potential to be helpful?

norm-referenced information: describes each student's performance relative to a other students in

a group (e.g., percentile ranks, stanines), or

criterion-referenced information: describes each student's performance in terms of status on

specific learning outcomes (e.g., number of items correctly answered for each specific objective)

❏ Norm-referenced information.

❏ Criterion-referenced information.

❏ Both types of information are equally useful in helping to plan for instruction.

❏ Neither, test information is not useful in helping to plan instruction.


19. Students' scores on standardized tests are sometimes inconsistent with their performances on

classroom assessments (e.g., teacher tests or other in-class activities). Which of the following is

not a reasonable explanation for such discrepancies?

❏ Some students freeze up on standardized tests, but they do fine on classroom

assessments.

❏ Students often take standardized tests less seriously than they take classroom

assessments.

❏ Standardized tests measure only recall of information while classroom assessments

measure more complex thinking.

❏ Standardized tests may have less curriculum validity than classroom assessment.

20. Elementary school teachers in the Baker School system collectively designed and developed

new curricula in Reading, Mathematics, and Science that is based on locally developed

objectives and objectives in state curriculum guides. The new curricula were not matched

directly to the content of the fourth grade standardized test. A newspaper reports the fourth grade

students in Baker Public Schools are among the lowest scoring districts in the State Assessment

Program. Which of the following would invalidate the comparison between Baker Public

Schools and other schools in the state?

❏ The curriculum objectives of the other districts may more closely match those of the

State Assessment.

❏ Other school systems did not design their curriculum to be consistent with the State

Assessment test.

❏ Instruction in Baker schools is poor.

❏ Other school systems have different promotion policies than Baker.

21. Which of the following choices typically provides the most reliable student-performance

information that a teacher might consider when assigning a unit grade?

❏ Scores from a teacher-made test containing two or three essay questions related directly

to instructional objectives of the unit.

❏ Scores from a teacher-made 20 item multiple-choice test designed to measure the

specific instructional objectives of the unit.

❏ Oral responses to questions asked in class of each student over the course of the unit.

❏ Daily grades designed to indicate the quality of in-class participation during regular

instruction.

22. A teacher gave three tests during a grading period and she wants to weight them all equally

when assigning grades. The goal of the grading program is to rank order students on

achievement. In order to achieve this goal, which of the following should be closest to equal?

❏ Number of items.

❏ Number of students taking each test.

❏ Average scores.

❏ Variation (range) of scores.


23. When a parent asks a teacher to explain the basis for his or her child's grade, the teacher

should

❏ explain that the grades are assigned fairly, based on the student's performance and other

related factors.

❏ ask the parents what they think should be the basis for the child's grade.

❏ explain exactly how the grade was determined and show the parent samples of the

student's work.

❏ indicate that the grading scale is imposed by the school board and the teachers have no

control over grades.

24. Which of the following grading practices results in a grade that least reflects students'

achievement?

❏ Mr. Jones requires students to turn in homework; however, he only grades the odd

numbered items.

❏ Mrs. Brown uses weekly quizzes and three major examinations to assign final grades in

her class.

❏ Ms. Smith permits students to redo their assignments several times if they need more

opportunities to meet her standards for grades.

❏ Miss Engle deducts 5 points from a student's test grade for disruptive behavior.

25. During the most recent grading period, Ms. Johnson graded no homework and gave only one

end-of-unit test. Grades were assigned only on the basis of the test. Which of the following is the

major criticism regarding how she assigned the grades?

❏ The grades probably reflect a bias against minority students that exists in most tests.

❏ Decisions like grade assignment should be based on more than one piece of information.

❏ The test was too narrow in curriculum focus.

❏ There is no significant criticism of this method providing the test covered the unit's

content.

26. In a routine conference with Mary's parents, Mrs. Estes observed that Mary's scores on the

state assessment program's quantitative reasoning tests indicate Mary is performing better in

mathematics concepts than in mathematics computation. This probably means that

❏ Mary's score on the computation test was below average.

❏ Mary is an excellent student in mathematics concepts.

❏ the percentile bands for the mathematics concepts and computation tests do not overlap.

❏ the mathematics concepts test is a more valid measure of Mary's quantitative reasoning

ability.

27. Many states are revising their school accountability programs to help explain differences in

test scores across school systems. Which of the following is not something that needs to be

considered in such a program?

❏ The number of students in each school system.

❏ The average socio-economic status of the school systems.

❏ The race/ethnic distribution of students in each school system.


❏ The drop-out rate in each school systems.

28. The following standardized test data are reported for John.

Subject -- Stanine Score

Vocabulary -- 7

Mathematics Computation -- 7

Social Studies -- 7

Which of the following is a valid interpretation of this score report?

❏ John answered correctly the same number of items on each of the three tests.

❏ John's test scores are equivalent to a typical seventh grader's test performance.

❏ John had the same percentile rank on the three tests.

❏ John scored above average on each of the three tests.

29. Mr. Klein bases his students' grades mostly on graded homework and tests. Mr. Kaplan bases

his students' grades mostly on his observation of the students during class. A major difference in

these two assessment strategies for assigning grades can best be summarized as a difference in

❏ formal and informal assessment.

❏ performance and applied assessment.

❏ customized and tailored assessment.

❏ formative and summative assessment.

30. John scored at the 60th percentile on a mathematics concepts test and scored at the 57th

percentile on a test of reading comprehension. If the percentile bands for each test are five

percentile ranks wide, what should John's teacher do in light of these test results?

❏ Ignore this difference.

❏ Provide John with individual help in reading.

❏ Motivate John to read more extensively outside of school.

❏ Provide enrichment experiences for John in mathematics, his better performance area.

31. In some states testing companies are required to release items from prior versions of a test to

anyone who requests them. Such requirements are known as

❏ open-testing mandates.

❏ gag rules.

❏ freedom-of-information acts.

❏ truth-in-testing laws.

32. Mrs. Brown wants to let her students know how they did on their test as quickly as possible.

She tells her students that their scored tests will be on a chair outside of her room immediately

after school. The students may come by and pick out their graded test from among the other tests

for their class. What is wrong with Mrs. Brown's action?

❏ The students can see the other students' graded tests, making it a violation of the

students' right of privacy.

❏ The students have to wait until after school, so the action is unfair to students who have

to leave immediately after school.


❏ Mrs. Brown will have to rush to get the tests graded by the end of the school day, hence,

the action prevents her from using the test to identify students who need special help.

❏ The students who were absent will have an unfair advantage, because her action allows

the possibility for these students to cheat.

33. A state uses its statewide testing program as a basis for distributing resources to school

systems. To establish an equitable distribution plan, the criterion set by the State Board of

Education provides additional resources to every school system with student achievement test

scores above the state average. Which cliché best describes the likely outcome of this regulation?

❏ Every cloud has its silver lining.

❏ Into each life some rain must fall.

❏ The rich get richer and the poor get poorer.

❏ A bird in the hand is worth two in the bush.

34. In a school where teacher evaluations are based in part on their students' scores on a

standardized test, several teachers noted that one of their students did not reach some vocabulary

items on a standardized test. Which teacher's actions is considered ethical?

❏ Mr. Jackson darkened circles on the answer sheet at random. He assumed Fred, who was

not a good student, would just guess at the answers, so this would be a fair way to obtain

Fred's score on the test.

❏ Mr. Hoover filled in the answer sheet the way he thought Joan, who was not feeling well,

would have answered based on Joan's typical in-class performance.

❏ Mr. Stover turned in the answer sheet as it was, even though he thought George, an

average student, might have gotten a higher score had he finished the test.

❏ Mr. Lund read each question and darkened in the bubbles on the answer sheet that

represented what he believed Felicia, a slightly below average student, would select as

the correct answers.

35. Mrs. Overton was concerned that her students would not do well on the State Assessment

Program to be administered in the Spring. She got a copy of the standardized test form that was

going to be used. She did each of the following activities to help increase scores. Which activity

was unethical?

❏ Instructed students in strategies on taking multiple choice tests, including how to use

answer sheets.

❏ Gave students the items from an alternate form of the test.

❏ Planned instruction to focus on the concepts covered in the test.

❏ None of these actions are unethical.

PART II

36. What is your gender?

❏ female

❏ male


37. Which of the following is the most appropriate description of the level at which you teach?

❏ elementary – primary (K – grade 3)

❏ elementary – intermediate (grades 4 – 6)

❏ elementary (K – 6)

❏ middle (grades 6 – 8)

❏ high (grades 9 – 12)

❏ secondary (grades 6 – 12)

❏ K – 12

❏ other

38. Which best describes the educational level you have attained?

❏ B.A. or B.S.

❏ M.A. or M.S.

❏ Specialist

❏ Ed.D.

❏ Ph.D.

39. Including the current year, how many years of experience do you have as a classroom

teacher?

❏ 1 – 5 years

❏ 6 – 10 years

❏ 11 – 15 years

❏ 16 – 20 years

❏ 21 – 25 years

❏ 26 – 30 years

❏ more than 30 years

40. To the best of your knowledge, did you take a standalone course in classroom assessment as

part of you undergraduate teacher preparation?

❏ yes

❏ no

41. Which of the following best describes your perception of the level of preparation for the

overall job of being a classroom teacher that resulted from your undergraduate teacher

preparation program?

❏ very unprepared

❏ somewhat unprepared

❏ somewhat prepared

❏ very prepared


42. Which of the following best describes your perception of the level of preparation for

assessing student performance that resulted from your undergraduate teacher preparation

program?

❏ very unprepared

❏ somewhat unprepared

❏ somewhat prepared

❏ very prepared


Appendix D

Intervention Module Design

Home Screen

Weekly Modules & Prompts


Appendix E

Prescreening and Informed Consent Questionnaire & Informed Consent

Thank you for expressing interest in participating in this research study. The following

questionnaire includes three sections:

(1) A pre-screening question to ensure that you are eligible to participate in this study.

(2) A description of the study followed by informed consent documentation. If you are

interested in participating in this study, you must provide informed consent and an email address

that you access regularly.

(3) A series of demographic and background questions (11 items).

This questionnaire should take you no longer than five minutes to complete. At the end, you will

be notified about which condition you have been randomly assigned to (i.e., the intervention

group or the control group). If you are randomly assigned to the intervention group, you will also

receive information about how to register for the four-week online professional development. If

you are assigned to the control group but would still like to receive the four-week online

professional development, I will offer the course again after the data collection period has ended

for this study upon request.

If you have any questions or concerns, please contact me via email at

[email protected] or via phone at (443) 235-0957.

***

Q8 Are you currently a music teacher in a PK-12 classroom in the United States?

● Yes (1)

● No (2)

[Skip To: End of Block If Are you currently a music teacher in a PK-12 classroom in the United

States? = Yes]

Unfortunately you are not eligible to participate in this study at this time. Thank you for your

interest, and best wishes to you in your future teaching endeavors. You may exit out of this

window at any time.

[Skip To: End of Survey If Unfortunately you are not eligible to participate in this study at this

time. Thank you for your... Is Displayed]


***

Q32 You are eligible to participate in this study.

Please continue to the next screen to read a description of the study and provide informed

consent.

***

Q7 Permission to Take Part in a Human Research Study Title of research study: Music

Teachers’ Assessment Literacy, Beliefs, & Practices: A Mixed Methods Intervention Study IRB

Protocol Number: 20-0054 Investigator: Jocelyn W. Armes Purpose of the Study The

purpose of the study is to examine the effectiveness of an online professional development

intervention for music teachers in changing assessment literacy, beliefs, and practices. A second

purpose of this study is to explore music teachers’ beliefs about assessment. Although

researchers in general education have examined assessment literacy, beliefs, and practices

separately, no one, to date, has examined these concepts at the same time. In addition, no one has

used an intervention to enhance music teachers’ assessment literacy, beliefs, and practices. Very

little is known about music teachers’ assessment literacy and beliefs, however, there is reason to

suspect that music teachers’ assessment practices and beliefs are informed by conflicting beliefs

about assessment and external expectations.

I expect that you will be in this research study for six weeks, from March 8, 2020 until April

13, 2020. I expect about 200 people will be in this research study. Explanation of

Procedures If you decide to participate in this study, you can expect the following:

Because this is an intervention study, and I am trying to measure the change in music teachers’

assessment literacy, beliefs, and practices, you will be randomly assigned to either the

intervention group or a control group. Your group will be chosen by chance, like flipping a

coin. You will have an equal chance of being assigned to either group. Both groups will take a

questionnaire-type survey with questions related to assessment knowledge, practices, and beliefs.

Intervention Group. If you are in the intervention group, you will be sent an additional email

with information about how to register for the online assessment workshop. The online

assessment workshop will take place in a virtual classroom via Google Classroom. You will have

access to the course from March 8, 2020 until April 13, 2020.

Each week you will be asked to complete a module. Each module is expected to take about two

hours to complete. In each module, you will read materials about assessment. Next, you will

complete a discussion activity with other participants in the intervention. Then, you will

complete a task related to designing an assessment for your classroom. Modules will stay open

for the entire course. However, you should complete the modules in order, and strive to complete

every module within a week. Control Group. For this study, the control group is a true control;

that is, the participants in the control group will not take part in the online professional

development. If you, however, would like to receive the online professional development after

the experiment is over you will be permitted to do so, upon request. At the conclusion of the

four-week online assessment workshop, both groups will be sent a final email with the link to the

final questionnaire. This will help me measure any changes in music teachers’ assessment


knowledge, practices, and beliefs. Voluntary Participation and Withdrawal Whether or

not you take part in this research is your choice. You can leave the research at any time and it

will not be held against you. Potential Benefits We cannot promise any benefits to you or

others from your taking part in this research. However, you may find that participating in online

professional development about assessment increases your knowledge and skill in this area of

your teaching practice. You may also find that collaborating with your music teacher peers

provided you with fresh perspectives related to assessment. Finally, your students may benefit

from any increase in your knowledge or skill in this area. Confidentiality Information

obtained about and from you for this study will be kept confidential to the extent allowed by law.

Research information that identifies you may be shared with the University of Colorado Boulder

Institutional Review Board (IRB) and others who are responsible for ensuring compliance with

laws and regulations related to research, including people on behalf of the Office for Human

Research Protections. The information from this research may be published for scientific

purposes; however, your identity will not be given out. Payment for Participation You will

not be paid to be in this study. Questions If you have questions, concerns, or complaints, or

think the research has hurt you, you can contact me at [email protected], or my

advisor, James Austin, at [email protected]. This research has been reviewed and

approved by an IRB. You may talk to them at (303) 735-3702 or [email protected] if: •

Your questions, concerns, or complaints are not being answered by the research team. • You

cannot reach the research team. • You want to talk to someone besides the research team. • You

have questions about your rights as a research subject. • You want to get information or provide

input about this research.

Q3 Informed Consent Documentation

I have read through the invitation to participate in this intervention study, am aware of the

potential risks and benefits of participation, and understand that being in this study is voluntary

and that my responses are confidential and private.

● I agree to participate

● I do not agree to participate

Q5 By providing your email in the space below, you acknowledge that (a) you have read through

the invitation to participate in this intervention study, (b) you are aware of the potential risks and

benefits of participation, and (c) you understand that being in this study is voluntary and that

your responses will be kept confidential and private.

________________________________________________________________

***

Demographic Information

Please provide the basic demographic information requested below. This information will be

kept confidential. You will not be personally identifiable in any documents or reports generated

from this study.


Q15 With what gender do you identify?

● Female (1)

● Male (2)

● Trans or Nonbinary (3)

Q16 What is your race?

● Caucasian or Non-Hispanic (1)

● Black or African American (2)

● Hispanic or Latinx (3)

● American Indian or Alaska Native (4)

● Asian (5)

● Native American or Pacific Islander (6)

● Biracial or multi-racial (7)

Q17 Which of the following grade levels do you teach? (Check all that apply.)

❏ Pre-Kindergarten (1)

❏ Kindergarten (5)

❏ 1st (6)

❏ 2nd (7)

❏ 3rd (8)

❏ 4th (9)

❏ 5th (10)

❏ 6th (11)

❏ 7th (12)

❏ 8th (13)

❏ 9th (14)

❏ 10th (15)

❏ 11th (16)

❏ 12th (17)

Q18 What courses do you teach? (Check all that apply.)

❏ Chorus/Vocal (1)

❏ Instrumental Band (2)

❏ Instrumental Orchestra (3)

❏ Instrumental Other (Jazz, marching, guitar, etc.) (4)

❏ General Music (5)

❏ Music Appreciation (6)

❏ Music Theory (7)

❏ Visual & Performing Arts (8)


Q19 Which best describes the educational level you have attained?

● Bachelor's degree (1)

● Master's degree (2)

● Master's +30 credits (3)

● Doctoral degree (4)

Q20 Including the current year, how many years of experience do you have as a classroom

teacher? (Answer as a number; e.g., 11.)

Q21 To the best of your knowledge, did you take a standalone course in classroom assessment as

part of your undergraduate teacher preparation?

● Yes (1)

● No (2)

Q22 Coming out of your undergraduate teacher preparation program, how prepared were you

for the job of being a music teacher?

● Very unprepared

● Unprepared

● Somewhat unprepared

● Somewhat prepared

● Prepared

● Very prepared

Q23 Coming out of your undergraduate teacher preparation program, how prepared were you to

assess students' learning?

● Very unprepared

● Unprepared

● Somewhat unprepared

● Somewhat prepared

● Prepared

● Very prepared

Q24 Have you ever taken a workshop (a few days or less) in which the only topic was

assessment?

● Yes (1)

● No (2)

Q25 Since completing your undergraduate degree, have you ever taken a course (a few weeks or

more) focused only on assessment?


● Yes (1)

● No (2)

***

[RANDOM ASSIGNMENT TO CONTROL OR INTERVENTION]

Q34 Thank you for your interest in participating in this study and completing the Informed

Consent documentation. You have been randomly assigned to the Control Group.

This means that you will not receive the online professional development from March 2020 -

April 2020. However, you will be asked to take a questionnaire about your assessment

knowledge, practices, and beliefs in March 2020 and April 2020. Your response to these

questionnaires is still important for comparison purposes.

If you would still like to receive the free online professional development in April 2020, please

select "Yes" below.

You should expect to receive a link to the study questionnaire from Jocelyn Armes, the Principal

Investigator, within an hour at the email address you provided. If you have further questions,

please contact Jocelyn Armes at [email protected] or (443) 235 - 0957.

● Yes, I would like to receive the free online professional development in April 2020, after

I complete the questionnaires in March and April. (4)

● No, I would not like to receive the free online professional development in April 2020,

after complete the questionnaires in March and April. (5)

[OR]

Q35 Thank you for your interest in participating in this study and completing the Informed

Consent documentation. You have been randomly assigned to the Intervention Group.

This means that you will receive the online professional development from March 2020 - April

2020.

You should expect to receive a link to the study questionnaire from Jocelyn Armes, the Principal

Investigator, within an hour at the email address you provided. If you have further questions,

please contact Jocelyn Armes at [email protected] or (443) 235 - 0957.


Appendix F

Adapted Classroom Assessment Literacy Inventory (CALI)

Classroom Assessment Literacy Inventory

In this section, you will be asked 20 multiple choice questions about assessment. Choose the

best answer to each question. Even if you are not sure of your choice, mark that response.

[Adapted from the Classroom Literacy Assessment Inventory (2000), by C. Mertler, Bowling

Green State University]

***

What is the most important consideration in choosing a method for assessing student

achievement?

o The ease of scoring the assessment.

o The ease of preparing the assessment.

o The accuracy of assessing whether or not instructional objectives were attained.

o The acceptance by the school administration.

When scores from a standardized test are said to be “reliable,” what does it imply?

o Student scores from the test can be used for a large number of educational decisions.

o If a student retook the same test, he or she would get a similar score on each retake.

o The test score is a more valid measure than teacher judgments.

o The test score accurately reflects the content of what was taught.

Mrs. Bruce wished to assess her students' understanding of key signature identification. Which

assessment strategy below would be most valid?

o Select a music theory text that has a "teacher's guide" with a test developed by the

authors.

o Develop an assessment consistent with an outline of what she has taught in class.

o Select a standardized test that provides a score on problem solving skills.

o Select an instrument that measures students' attitudes about problem solving strategies.

What is the most effective use a teacher can make of an assessment that requires students to

show their work (e.g., the way they arrived at a solution to a problem or the logic used to arrive

at a conclusion)?

o Assigning grades for a unit of instruction on problem solving.

o Providing instructional feedback to individual students.

o Motivating students to attempt innovative ways to solve problems.

o None of the above.

Ms. Green, the principal, was evaluating the teaching performance of Mr. Williams, the

elementary general music teacher. One of the things Ms. Green wanted to learn was if the

students were being encouraged to use higher order thinking skills in the class. What

documentation would be the most valid to help Ms. Green to make this decision?


o Mr. Williams’ lesson plans.

o The state curriculum guides for fourth grade music.

o Copies of Mr. Williams’ unit tests or assessment strategies used to assign grades.

o Worksheets completed by Mr. Williams’ students, but not used for grading.

A teacher wants to document the validity of the scores from a classroom assessment strategy she

plans to use for assigning grades on a class unit. What kind of information would provide the

best evidence for this purpose?

o Have other teachers judge whether the assessment strategy covers what was taught.

o Match an outline of the instructional content to the content of the actual assessment.

o Let students in the class indicate if they thought the assessment was valid.

o Ask parents if the assessment reflects important learning outcomes.

Which of the following would most likely increase the reliability of Mrs. Lockwood's multiple-

choice end-of-unit examination in middle school band?

o Use a curriculum guide to develop the test questions.

o Change the test format to true-false questions.

o Add more items like those already on the test.

o Add an essay component.

Ms. Gregory wants to assess her students' skills in organizing ideas rather than just repeating

facts. Which words should she use in formulating essay exercises to achieve this goal?

o compare, contrast, criticize

o identify, specify, list

o order, match, select

o define, recall, restate

Mr. Woodruff wanted his students to appreciate the choral works of Craig Hella Johnson. Which

of his test items shown below will best measure his instructional goal?

o What is the name of the collection of Dorothy Water’s poetry that Johnson set to music?

o True or False: Johnson was the first ever Artist in Residence at Texas State University.

o Johnson writes works for: 1. Chorus 2. Soloists 3. Trios 4. All of the above.

o Discuss briefly your view of Johnson’s contribution to choral literature.

Several students in Ms. Atwell's class received low scores on her end-of-unit test covering

counting rhythms in simple duple meters. She wanted to know which students were having

similar problems so she could group them for instruction. Which assessment strategy would be

best for her to use for grouping students?

o Use the test provided in the "teacher's guide."

o Have the students take a test that has separate items for each simple duple meter.

o Look at the student's records to see which topics the students had not performed well on

previously.

o Give students practice worksheets to complete and have them write the counts in under

the examples.


Many teachers score classroom tests using a 100-point percent correct scale. In general, what

does a student's score of 90 on such a scale mean?

o The student answered 90% of the items on this test correctly.

o The student knows 90% of the instructional content of the unit covered by this test.

o The student scored higher than 90% of all the students who took the test.

o The student scored 90% higher than the average student in the class.

Students in Mr. Jakman's music class are required to compose an original song as part of their

end-of-unit grade. Which scoring procedure below will maximize the objectivity of assessing

these student projects?

o When the compositions are turned in, Mr. Jakman identifies the most beautiful

compositions — to his ear — gives them the highest grades, the next most beautiful gets a lower

grade and so on.

o Mr. Jakman asks other teachers in the building to rate each composition on a 5-point

scale based on their quality.

o Before the compositions are turned in, Mr. Jakman constructs a scoring key based on

critical features of the projects as identified by the highest performing students in the class.

o Before the compositions are turned in, Mr. Jakman prepares a model of the critical

features of a composition and assigns scoring weights to these features. The compositions with

the highest scores receive the highest grade.

At the close of the first month of school, Mrs. Friend gives her fifth grade students a test she

developed for musical aptitude. Her test is modeled after a standardized aptitude test. It presents

aural examples and then asks questions related to identifying features in the music. When the test

was scored, she noticed that two of her students—who had been performing well in their class

assignments—scored much lower than other students.

Which of the following types of additional information which would be most helpful in

interpreting the results of this test?

o The gender of the students.

o The age of the students.

o Reliability data for the standardized test she used as the model.

o Reading comprehension scores for the students.

Frank, a fifth grader taking private violin lessons, received a G. E. (grade equivalent score) of 8.0

on the music theory subtest of a standardized test. This score should be interpreted to mean that:

o Frank can understand 8th grade music theory material.

o scored as well as a typical beginning 8th grader scored on this test.

o is performing in music theory at the 8th grade level.

o will probably reach maximum performance in music theory at the beginning of the 8th

grade.

When the directions indicate each section of a test is timed separately, which of the following is

acceptable test-taking behavior?

o John finishes the vocabulary section early; he then rechecks many of his answers in that

section.


o Mary finishes the vocabulary section early; she checks her answers on the previous test

section.

o Jane finishes the vocabulary section early; she looks ahead at the next test section but

does not mark her answer sheet for any of those items.

o Bob did not finish the vocabulary section; he continues to work on that section when the

testing time is up.

Ms. Camp is starting a new concert cycle with a unit on minor keys in her auditioned choir class.

Before beginning the unit, she gives her students a test on key signature identification. Which of

the following is the most likely reason she gives this test to her students?

o The principal needs to report the results of this assessment to the state testing director.

o Ms. Camp wants to give students practice in taking tests early in the semester.

o Ms. Camp wants to check for prerequisite knowledge in her students before she begins

the unit.

o Ms. Camp wants to measure growth in student achievement of these concepts, and scores

on this test will serve as the students' knowledge baseline.

To evaluate the effectiveness of the Kodaly curriculum for her first graders, Ms. Allen gave them

a standardized test normed for third graders. To decide how well her students performed, Ms.

Allen compared her students' scores to those of the third-grade norm group. Why is this an

incorrect application of standardized test norms?

o The norms are not reliable for first graders.

o The norms are not valid for first graders.

o Third grade mathematics items are too difficult for first graders.

o The time limits are too short for first graders.

When planning classroom instruction for a unit on seventh chord construction, which of these

types of information have more potential to be helpful?

o norm-referenced information: describes each student's performance relative to other

students in a group (e.g., percentile ranks).

o criterion-referenced information: describes each student's performance in terms of status

on specific learning outcomes (e.g., number of items correctly answered for each specific

objective.

o Both types of information are equally useful in helping to plan for instruction.

o Neither, test information is not useful in helping to plan instruction.

Students' scores on standardized tests are sometimes inconsistent with their performances on

classroom assessments (e.g., teacher tests or other in-class activities). Which of the following is

not a reasonable explanation for such discrepancies?

o Some students freeze up on standardized tests, but they do fine on classroom

assessments.

o Students often take standardized tests less seriously than they take classroom

assessments.

o Standardized tests measure only recall of information while classroom assessments

measure more complex thinking.

o Standardized tests may have less curriculum validity than classroom assessment.


Elementary school teachers in the Baker School system collectively designed and developed new

curricula in elementary general music, chorus, and band that are based on locally developed

objectives and objectives in state curriculum guides. The new curricula were not matched

directly to the content of the fourth-grade standardized test. A newspaper reports the fourth-grade

students in Baker Public Schools are among the lowest scoring districts in the State Assessment

Program.

Which of the following would invalidate the comparison between Baker Public Schools and

other schools in the state?

o The curriculum objectives of the other districts may more closely match those of the State

Assessment.

o Other school systems did not design their curriculum to be consistent with the State

Assessment test.

o Instruction in Baker schools is poor.

o Other school systems have different promotion policies than Baker.


Appendix G

Music Teacher Assessment Implementation Inventory (MTAII)

Music Teacher Assessment Implementation (MTAI) Inventory

In this section, you will be asked to answer two questions related to the forms of assessments you

give, and the purposes for which you use assessment. In both cases, reflect upon a typical class in

your main teaching area.

***

Reflect upon a typical class in your main teaching area.

Within the last four weeks, how often have you used the following forms of assessment?

Never Less Than

Once Per

Week

Once Per

Week

Several

Times Per

Week

Nearly

Every Day

Written Tests/Quizzes o o o o o

Written Classwork/Homework o o o o o

Group Performances o o o o o

Individual Performances o o o o o

Projects o o o o o

Portfolios o o o o o

Attendance o o o o o

Participation o o o o o

***


Reflect upon a typical class in your main teaching area.

Within the last four weeks, how often have you used assessment for the following purposes?

Never Less Than

Once Per

Week

Once Per

Week

Several

Times Per

Week

Nearly

Every Day

Summative (i.e., assessments

used to provide information

about mastery, usually at the

end of instruction)

o o o o o

Formative (i.e., assessments

used to provide students

feedback during ongoing

instruction)

o o o o o

Diagnostic (i.e., assessments

used to identify areas of

improvement for students)

o o o o o

Placement (i.e. assessments

used to sort or order students

into targeted groupings)

o o o o o

Extramusical (i.e.,

assessments used to motivate

or hold students accountable

for behaviors)

o o o o o


Appendix H

Music Teacher Assessment Beliefs Inventory (MTABI)

In this section, you will be asked to answer questions related to your beliefs about assessment.

Music teachers may adopt different views as to the nature or value of assessment. Please

indicate the extent to which you agree with each statement listed below:

Strongly

Disagree

Disagree Somewhat

Disagree

Somewhat

Agree

Agree Strongly

Agree

Assessment results are

trustworthy

o o o o o o

Assessment consistently

provides useful

information

o o o o o o


dependable

o o o o o o

Assessment is an

important music teacher

responsibility

o o o o o o

Assessment helps music

teachers be more

effective

o o o o o o

Assessment and

instruction can be

seamlessly integrated

o o o o o o

Assessment forces music

teachers to contradict

their beliefs

o o o o o o

Assessment helps music

teachers treat their

students fairly

o o o o o o

Assessment interferes

with teaching

o o o o o o

Assessment results are of

great use to music

teachers

o o o o o o



rightfully ignored by

most music teachers

o o o o o o

Assessment has little

impact on music teaching

o o o o o o


often inaccurate

o o o o o o


prone to error

o o o o o o

Assessment typically

provides precise

information

o o o o o o

Assessment causes

teachers to be conformists

o o o o o o

Assessment reduces

music teacher creativity

o o o o o o


Appendix I

Invitation to Participate

SUBJECT LINE: Music Teacher Assessment Professional Development

Greetings Music Educators:

I am writing to ask for your help with a study to better understand what music teachers know and

feel about assessment in their classrooms, and to see if professional development created

specifically for music teachers can change those thoughts and feelings. I am sending this

invitation to all members of NAfME who teach music in any PK-12 setting, and I need as many

people as possible to respond.

If you decide to participate in this study, you will complete a brief online questionnaire, which

should take less than 10 minutes. Then, you will be randomly assigned to either an intervention

or control group, and asked to take an online survey about your assessment knowledge, practices,

and beliefs. Those in the intervention group will also receive access to a four-week assessment

workshop, and those in the control group will only be asked to take the online survey. All

participants will also take the same online survey at the end of the study. Your participation is

voluntary, and your responses are confidential. When you click on the screening survey link

below, you will learn more about the study to help you decide whether to participate.

If you have any questions about the study, please email or contact me at

[email protected]. A summary of major research findings will be made available to

interested participants upon request. This research has been reviewed and approved by an IRB.

You may talk to them at (303) 735-3702, or [email protected].

This screening survey will remain open to you for just under three weeks, closing on March 30,

2020. To begin the survey, click on this link or copy and paste the URL into your internet

browser: https://cuboulder.qualtrics.com/jfe/form/SV_0OjuK7q3vBqPPed

Warm Regards,

Jocelyn W. Armes

PhD Candidate in Music Education

University of Colorado Boulder

[email protected]

https://cuboulder.qualtrics.com/jfe/form/SV_0OjuK7q3vBqPPed


Appendix J

Trigger Email Correspondence

Intervention Group Pretest Trigger Email

Greetings Music Educators:

Thank you for enrolling in this study about music teacher assessment literacy, beliefs, and

practices. You have been assigned to the intervention group of this study, and will receive the

online professional development from March 23 - April 19, 2020.

In order to enroll in the course you will need a Google Gmail account. Some school districts use

Google products for teacher productivity; for the purposes of this study, it would be simpler to

use a personal Gmail account. If you do not have one, please go to www.gmail.com and create a

google email account to enroll in this course. Instructions for creating a Gmail can be found by

following this link: https://edu.gcfglobal.org/en/gmail/setting-up-a-gmail-account/1/.

After you are logged into your Gmail account, you can enroll in the professional development

course using the following steps:

1. In the same window, type classroom.google.com

2. At the top, click + (Add) and then Join class.

3. Enter the class code: rojrvgn

On the homescreen of the course, you will find further instructions for how to navigate through

our classroom. The professional development course will last for four weeks. The course

calendar is as follows:

http://www.gmail.com/

http://www.gmail.com/

https://edu.gcfglobal.org/en/gmail/setting-up-a-gmail-account/1/

https://edu.gcfglobal.org/en/gmail/setting-up-a-gmail-account/1/

https://support.google.com/edu/classroom/answer/6020297?co=GENIE.Platform%3DDesktop&hl=en

https://support.google.com/edu/classroom/answer/6020297?co=GENIE.Platform%3DDesktop&hl=en


Week Start Finish

I: Choosing Assessment Methods Appropriate for Instructional

Decisions

March 23 March 29

II: Developing Assessment Methods Appropriate for Instructional

Decisions

March 30 April 5

III: Administering, Scoring, and Interpreting Assessments April 6 April 12

IV: Using Assessment Results for Educational Decision Making April 13 April 19

After completing the course, you will be sent a link to another questionnaire like the one you

recently completed. Thank you, again, for your participation in this study. If you have any

further questions, please contact me via email at [email protected].

Warm regards,

Jocelyn W. Armes

PhD Candidate in Music Education

University of Colorado Boulder


Appendix K

CALI Item Distractor Analysis


Appendix L

Intervention Participant Feedback

Was the online

professional

development

relevant to you

as a music

teacher?

Was the online

professional

development course

appropriately

challenging or too

difficult?

What did you like

about the online

professional

development

course?

What would you

have changed about

the online

professional

development course

to make it more

enjoyable or useful?

Would you

recommend this

online

professional

development

course to other

music educators?

1 Yes, it was very

relevant for me as

a jh and hs choir

teacher, and I will

be referring back

to the chapters as

I need

I thought overall it

was appropriately

challenging. There

was only one section

that I got sort of lost,

but it wasn't overly

difficult

I thought there was a

lot of great content in

the primary book we

were reading and

discussing from. I

always love to learn

of a new to me author

or researcher. I also

really enjoyed the

discussion with other

music teachers,

hearing some of our

similarities and

differences, in

thinking, practices,

and circumstances

Overall I wouldn't

change anything. I

personally struggle

with a lot of

academic reading and

also read very slowly,

so that articles and

chapters could

sometimes be a lot to

get through. Having

audio chapters or

video format

information may

have been helpful at

times, but I know

how to work around

my reading issues for

myself, so it wasn't

an absolute need

I would, but I am

not sure how well

received it would

be by every music

teacher. I think it

approaches some

negative thinking

and stereotypes

that I hear and see

from other music

teachers in a way

that is very clear

and shows

alternative

approaches, but it

may be too

engrained in some

people to have a

positive outtake

from this pd

2 Yes it was. I felt it was

appropriately

challenging.

It forced me to take a

look at my own

assessment practices

(or lack thereof) and

decide how to use

more assessment in

my classroom.

I'm not sure if I

would have changed

anything.

Yes I would.

3 Yes It was not too

difficult, but perhaps

had a little too many

texts involved. Some

seemed more relevant

than others.

Reading the texts,

especially the Shaw

book and Russell

article.

It seemed to be

designed like a

college course, which

would be fine for

college students.

However, if the target

audience is working

teachers, I'd

recommend cutting

back on some of the

reading and activities

and streamlining it a

bit. It was a lot you

were asking people to

do.

I would highly

recommend some

of the texts shared,

but the

development

course itself I

would probably

not recommend.


4 I was able to

complete the PD

course. What I

experienced was

relevant to my

position as a

general music

teacher.

To easy I appreciated the

design of the course.

The readings were

informative without

begin too complex.

The assignments

seemed focused on

real world teaching

problems.

I would have

appreciated the

course being offered

over a longer period

of time. 8 or more

weeks would have

allowed me the time

to participate fully in

the course.

I would

recommend this

course.

Assessment is a

critical element to

music teaching

which is

frequently

misunderstood.

Well designed PD

in assessment is

badly needed.

5 Yes. I was only

able to complete

some activities

because my

schedule became

overloaded with

distance learning

preparations and

implementation.

What I

participated in

was valuable. I

wish

circumstances had

been different so I

could have

benefited from

the training in its

entirety.

Appropriately

challenging for

ordinary

circumstances.

Difficult to manage in

what became my

current situation.

It made me think and

presented valuable

information on ways

assessment can be a

valuable tool in the

music classroom

beyond jumping

through district

hoops.

No covid-19... but

you had zero control

of that

Yes

6 Yes Appropriately

challenging

The accessibility of

an online course was

appealing.

Nothing, I found it

quite useful.

Yes

7 Yes. I felt it was

appropriately

challenging.

It used relevant

material. I plan on

buying at least one of

the books that some

chapters of the

readings were taken

from so I can read it

in full and mark it up.

The "UGH!"

comments

participants kept

leaving. It made them

sound like my jr high

students.

Yes.

8 Yes I think it was

appropriately

challenging.

However, I did not

have enough time to

fully participate in the

course.

The reading was

insightful

Other than it taking

place during a global

pandemic where I've

been engaged in

other PD on

smartmusic, zoom

and other platforms

to serve my students,

nothing.

I think a lot of

teachers in my

district would

benefit from this

development.


Appendix M

Teacher-Constructed Task Exemplar


Appendix N

Descriptive Data for CALI, MTAII, and MTABI by Assigned Group

CALI

Control Group (N = 25) Intervention Group (N = 18)

Pretest Posttest Pretest Posttest

Mean SD Mean SD Mean SD Mean SD

Standard 1 3.60 0.96 3.88 0.73 Standard 1 3.56 0.86 4.06 0.64

1.1 0.96 0.20 1.00 0.00 1.1 1.00 0.00 1.00 0.00

1.2 0.68 0.48 0.76 0.44 1.2 0.44 0.51 0.78 0.43

1.3 0.88 0.33 1.00 0.00 1.3 1.00 0.00 1.00 0.00

1.4 0.80 0.41 0.84 0.37 1.4 0.89 0.32 0.94 0.24

1.5 0.28 0.46 0.28 0.46 1.5 0.22 0.43 0.33 0.49

Standard 2 3.2 0.96 3.28 0.84 Standard 2 3 0.91 3.22 0.81

2.1 0.88 0.33 0.96 0.20 2.1 0.89 0.32 0.89 0.32

2.2 0.24 0.44 0.20 0.41 2.2 0.11 0.32 0.22 0.43

2.3 0.68 0.48 0.60 0.50 2.3 0.56 0.51 0.72 0.46

2.4 0.96 0.20 1.00 0.00 2.4 1.00 0.00 1.00 0.00

2.5 0.44 0.51 0.52 0.51 2.5 0.44 0.51 0.39 0.50


3.1 0.88 0.33 0.92 0.28 3.1 0.72 0.46 0.83 0.38

3.2 0.96 0.20 0.88 0.33 3.2 0.89 0.32 1.00 0.00

3.3 0.48 0.51 0.44 0.51 3.3 0.61 0.50 0.67 0.49

3.4 0.52 0.51 0.56 0.51 3.4 0.50 0.51 0.67 0.49

3.5 0.96 0.20 1.00 0.00 3.5 0.94 0.24 0.94 0.24


4.1 0.52 0.51 0.48 0.51 4.1 0.28 0.46 0.50 0.51

4.2 0.84 0.37 0.72 0.46 4.2 0.89 0.32 0.67 0.49

4.3 0.40 0.50 0.36 0.49 4.3 0.44 0.51 0.61 0.50

4.4 0.32 0.48 0.32 0.48 4.4 0.50 0.51 0.44 0.51

4.5 0.16 0.37 0.16 0.37 4.5 0.00 0.00 0.06 0.24

Total Score 12.84 2.54 13.00 2.02 Total Score 12.33 1.24 13.67 1.75


MTAII

Control Group (n = 25) Intervention Group (n = 18)


Mean SD Mean SD Mean SD Mean SD

Practices Written Tests & Quizzes 1.76 0.72 2.08 0.64 1.72 0.67 1.72 0.67

Written Classwork 1.88 0.83 2.56 1.04 2.06 0.94 2.67 0.91

Group Performances 3.68 1.22 3.08 1.61 3.89 1.23 2.61 1.50

Individual Performances 3.16 1.18 2.92 1.04 2.78 1.00 2.61 1.20

Projects 2.20 1.16 1.96 0.84 2.11 1.32 2.22 0.88

Portfolios 1.32 0.56 1.44 1.12 1.17 0.38 1.50 0.79

Attendance 2.68 1.84 2.12 1.62 3.28 1.93 2.89 1.84

Participation 3.76 1.62 3.56 1.5 4.44 1.34 4 1.33

Purposes Summative 2.08 0.82 2.12 0.73 2.33 0.84 2 0.77

Formative 3.76 1.17 3.60 1.26 3.78 1.48 4.06 1.21

Diagnostic 3.16 1.52 2.84 1.31 3.22 1.56 3.28 1.36

Placement 1.40 0.87 1.44 0.77 1.67 1.03 1.56 0.71

Extramusical 2.56 1.53 2.12 1.20 2.22 1.40 2.06 1.06



MTABI

Control Group

(n = 25)

Intervention Group

(n = 18)


x̄ SD x̄ SD x̄ SD x̄ SD

Assessment is an important music teacher responsibility 5.16 0.75 5.20 0.76 5.05 0.98 5.16 0.75

Assessment and instruction can be seamlessly integrated 4.96 0.94 5.20 0.91 4.86 0.99 5.12 0.91

Assessment helps music teachers to be effective 5.08 0.81 5.00 0.91 5.00 0.87 5.00 0.85

Assessment has little impact on music teaching⍑ 4.80 1.26 4.76 0.97 4.86 1.21 5.00 0.87

Assessment forces music teachers to contradict their beliefs⍑ 4.20 1.47 4.80 0.91 4.30 1.32 4.81 0.98

Assessment consistently provides useful information 4.84 0.85 4.72 0.79 4.72 0.93 4.70 0.74

Assessment results are rightfully ignored by most music

teachers⍑ 4.60 0.91 4.60 1.08 4.56 1.03 4.67 1.06

Assessment reduces music teacher creativity⍑ 4.48 1.30 4.56 1.04 4.47 1.33 4.65 1.02

Assessment causes music teachers to be conformists⍑ 4.04 1.20 4.44 1.04 4.19 1.26 4.56 1.05

Assessment results are of great use to music teachers 4.04 1.54 4.40 1.26 4.3 1.41 4.51 1.06

Assessment interferes with teaching⍑ 4.04 1.31 4.20 1.16 3.98 1.32 4.47 1.10

Assessment results are dependable 4.32 0.63 4.48 0.82 4.09 0.81 4.37 0.79

Assessment results are trustworthy 4.24 0.83 4.36 0.86 4.12 0.79 4.35 0.78

Assessment helps music teachers treat their students fairly 4.08 1.58 4.16 1.38 4.02 1.32 4.26 1.22

Assessment results are often inaccurate⍑ 4.28 1.10 4.08 0.95 4.09 1.13 4.05 0.87

Assessment typically provides precise information 3.80 1.08 3.96 1.06 3.74 0.98 3.95 0.95

Assessment results are prone to error⍑ 4.00 1.12 3.96 1.06 3.70 1.06 3.86 0.94

Total 74.96 10.78 76.88 9.57 74.05 11.59 77.49 9.30

*Agreement scale from 1-6.

** Bolded posttest means indicate an increase from pretest means.

⍑ Negatively phrased items that were reverse coded after data collection.

music teachers' assessment literacy, beliefs, & practices

Documents

Transcript of music teachers' assessment literacy, beliefs, & practices