Psychometric Review of Articulation Tests for Preschool ...
-
Upload
khangminh22 -
Category
Documents
-
view
2 -
download
0
Transcript of Psychometric Review of Articulation Tests for Preschool ...
Psychometric Review
1
RUNNING HEAD: Psychometric Review Psychometric Review of Speech Tests for Preschool Children: 25 Years Later A Senior Honors Thesis Presented in Partial Fulfillment of the Requirements for graduation with research distinction in Speech and Hearing Science in the Undergraduate Colleges of The Ohio State University By Brook N. Mathias The Ohio State University June, 2010 Project Advisor: Dr. Rebecca McCauley, Department of Speech and Hearing Science
Psychometric Review
3
Abstract
The purposes of this study were to (a) examine the psychometric quality of 9 tests that
help speech-language pathologists assess the speech sound production of preschool
children and (b) compare these results with those obtained since a similar study
conducted 25 years ago (McCauley & Swisher, 1984). To accomplish this, manuals for
these 9 tests were examined using 11 criteria considered important for standardized
norm-referenced tests. All of the current tests met at least 4, and 8 of the 9 tests met at
least 5 of the criteria. On the other hand, reliability was unmet by all the tests. When
compared to previous results, some improvement in the past 25 years can be seen.
Nonetheless, more needs to be done to provide the best possible tools for identification of
speech sound disorders in young children.
Psychometric Review
4
Acknowledgements I would like thank my advisor, Rebecca J. McCauley, for her constant support and
guidance throughout the entire thesis process. Her dedication and experience has proven
to be invaluable and I have grown professionally and academically as a result. I would
like thank Michelle Bourgeois for all of her time, encouragement, and support throughout
the year. Furthermore, I would like to thank Abby Kinsey and Melanie Wollam for their
help in the accuracy procedures as well as their collaboration in the review process.
This project was supported by the Arts and Sciences Undergraduate Scholarship
and the Social and Behavioral Sciences Undergraduate Research Scholarship.
Psychometric Review
5
Table of Contents Abstract………………………………………………………………………………….. 3 Acknowledgements……………………………………………………………………… 4 Chapter 1: Introduction and Literature Review…………………………………………. 6 Chapter 2: Methods…………………………………………………………………….. 17 Chapter 3: Results……………………………………………………………………… 26 Chapter 4: Discussion……………………………………………………………………28 Chapter 5: Summary……………………………………………………………………..29 References………………………………………………………………………………..30 Appendix. Reviewed Tests………………………………………………………………32 Tables ……………………………………………………………………………………33 List of Figures……………………………………………………………………………38 Figures……………………………………………………………………………………39
Psychometric Review
6
Psychometric Review of Speech Tests for Preschool Children: 25 Years Later
Tests of speech sound development have been useful to clinicians in identifying
problems and patterns in a person’s speech and creating an intervention program
applicable to that individual’s needs. However, the clinician must determine which test is
appropriate for a given purpose, which requires the knowledge of the tests’ materials and
the child to whom the test will be administered. Therefore, it is necessary for a clinician
to be familiar with the adequate skills and tools that a test requires (Andersson, 2005).
In the last 25 years, the development of new or updated speech and language tests
has widened the options of tests from which the clinician can choose. Often the clinician
will choose to administer norm-referenced tests based on its ability to compare a client to
levels expected of a child within a similar age group or category to a set of data on
similar children (i.e., to a set of norms). Based on the standards established by the test
developers, a speech-language pathologist is able to answer the question of how their
client compares to the average (Shipley & McAfee, 2009). This gives a basis to
determine whether the client falls within or outside the range of the average children in
that group. If the child’s scores fall outside of the normative sample, the child may be
said to have a speech or language behavior resulting in significant speech or language
behavior differences (McCauley & Swisher, 1984).
Through the use of norm-referenced standardized tests, professionals are able to
rely on these tests to calculate scores without excessive external contributors such as
setting or test instruction errors. Thus, to maximize validity and reliability, test users must
follow the administration directions as they are instructed in the test manual (Shipley &
McAfee, 2009). As a follow up to the Psychometric Review of Language and Articulation
Psychometric Review
7
Tests for Preschool Children (McCauley & Swisher, 1984), this review is intended to
revisit the results of McCauley & Swisher’s original study and re-address currently
available speech sound production tests based on similar criteria. Because McCauley &
Swisher (1984) found that the 30 language and articulation tests reviewed failed to
exhibit many of the psychometric characteristics required of well standardized norm-
referenced tests, redoing this study will evaluate whether or not this situation has
improved.
Since 1984, several reviews of the psychometric characteristics of language tests
are available. In reviews that are available, McCauley & Swisher’s criteria from the 1984
review are often cited and replicated. Researchers often combined McCauley &
Swisher’s criteria alongside their own developed criteria to evaluate specific tests
(Andersson, 2005; Plante & Vance, 1994). But in most reviews the researchers fail to
consider more than a few tests or do not review each test based on a set list of criteria.
However, the review of psychometric information of speech sound production tests is
scarce. Accordingly, this study applies the same 11 criteria to 9 norm-referenced speech
sound production tests for use with preschool children.
The intent of this paper is to revisit the concerns raised by McCauley & Swisher
25 years ago by reviewing test materials that were created following 1984, specifically, in
1998 or after. Discussions stimulated through this paper may benefit speech-language
pathologists who administer speech tests. Also, clinicians can use this information to be
more aware of the psychometric data contained in these tests and become more
knowledgeable of each test’s strengths and weaknesses when making clinical decisions.
Test users can also use this data to conduct an appropriate psychometric review before
Psychometric Review
8
purchasing a test (McCauley & Swisher, 1984). Nonetheless, this study should not be
used as a definitive review but as a basic guideline for clinicians in the area of a test’s
psychometrics.
It is important to understand several concepts when reviewing norm-referenced
standardized tests for their psychometric data. Therefore, prior to discussing the criteria
used in this review, it is necessary to address the following: (a) test validity and reliability
and (b) the normative sample.
Background on Basic Psychometric Concepts
Validity and Reliability
“Most standardized tests published for speech language pathologists include
details concerning the technical quality of the instruments including information
concerning the sample to whom the test was administered, the norming procedures, and
the reliability and validity of the test” (Hutchinson, 1996, p. 109). Clinicians can be
confident in using a measurement instrument when that instrument shows evidence of
validity and reliability. To be considered well developed as well as standardized, tests
must be supported by evidence that they function as intended when used as recommended.
Specifically, they must demonstrate evidence of reliability and validity for intended
purposes and populations. Reviewing a test’s psychometric characteristics entails
critically evaluating evidence that the test functions as intended and consistently for the
purposes and populations for which it was originally developed and is currently used
(McCauley & Strand, 2008).
A test is said to be more valid when the instrument measures what it is designed
to measure and does so accurately. For example, a thermometer is valid measurement
Psychometric Review
9
instrument of temperature to the degree that it accurately measures temperature. In the
case of speech sound production tests, the instrument is valid if it accurately measures a
child’s speech production ability as it is intended to be measured. However, test validity
is largely the product of successful test construction and evaluation (McCauley &
Swisher, 1984). Several approaches to gathering validity evidence should be considered
to ensure the overall accuracy of an instrument. Test validity is addressed by obtaining
multiple sources of evidence that support the use of a test for a particular purpose
(Messick, 1989; Plante & Vance, 1994) Therefore, with regard to validity, three kinds are
considered important to any test that measures the behavior and is used to make
inferences about underlying abilities; these sources have sometimes been referred as-
construct, content, and criterion-related validity (APA, 1974).
Construct validity refers to evidence that a test measures a predetermined
theoretical construct, which is an explanation of a behavior or attribute based on
empirical observation (Shipley & McAfee, 2009). For example, the concept that
children’s speech production improves with age is based on speech development studies.
Therefore, a valid test should show that when a test is administered to preschool aged
children, their scores should progressively improve as they grow older. Construct validity
must be shown through evidence that a test can be used reliably, that individual test items
are sensitive to the intended construct and insensitive to extraneous factors, and that
performance on the test provides the intended information about the test taker (Plante &
Vance, 1994). Because construct validity is important to the overall validity of a test, this
concept was added as the eleventh criterion to McCauley & Swisher’s (1984a) original
ten criteria. It had been initially not included in the review because of concerns about
Psychometric Review
10
difficulties in operationalizing the criterion to be associated with such evidence
(McCauley & Swisher, 1984).
Content validity refers to evidence that a test’s contents are representative of the
content domain of the skill being assessed (Shipley & McAfee, 2009). Content validity is
often the first aspect of validity addressed in a test manual and, in many cases, is the
principal form of validity addressed (Plante & Vance, 1994). In the case of the study,
content validity can be shown if the speech sound production test addresses all phonemes
which are necessary for assessing the entire articulation spectrum. It is important that the
content validity in a test is examined by an individual who possesses expertise on the
behavior being assessed (McCauley & Swisher, 1984).
Criterion-related validity refers to evidence of validity that is established by use
of an external criterion (Shipley & McAfee, 2009). If a new test’s results agree with the
results of tests that users already accept as valid measures of the construct, we can have
increased confidence in the new test. “On the other hand, when the test under review
produces results that are quite different from the accepted test, some analysis and
explanation of the differences in the two tests should be provided” (Hutchinson, 1996, p.
111). Therefore, a valid instrument shows collection of empirical evidence that scores on
the test are related to some other measure of the behavior being assessed. In most cases,
as well as in our review, two types of criterion-related validity are evaluated: concurrent
and predictive validity.
Concurrent validity refers to the validity of a test in comparison to a widely
accepted standard (Shipley & McAfee, 2009). For example, when determining if a
person’s score on an achievement test allows for estimation of that person’s score on a
Psychometric Review
11
criterion measure, the score must be compared to other achievement tests presumed to be
valid and/or judgments of achievement by teachers, parents, and students. “If the person’s
score on the achievement test presents evidence of content validity and elicits test scores
corresponding closely to judgments and scores from other valid achievement tests, we
can conclude that there is evidence of that test’s criterion-related validity” (Salvia,
Ysseldyke, & Bolt, 2010, p. 67).
Predictive validity refers to evidence of the test’s ability to predict performance
(the criterion measure) in another situation or at a later time. It implies that there is a
known relationship between the behaviors the test measures and the skills the exhibited at
some future time (Shipley & McAfee, 2009). Thus, predictive criterion-related validity
refers to the temporal sequence by which a person’s performance on some criterion
measure is estimated on the basis of that person’s current assessment. For example,
reading progress can be assessed by a reading achievement test (presumed to valid) or by
teacher judgments of reading ability. If the reading readiness test has content validity and
produces scores that are highly related to either later teacher judgments of readiness or
validity assessed reading skills, we can conclude that the test is a valid test of reading
readiness and addresses predictive validity accurately (Salvia, Ysseldyke, & Bolt, 2010).
Note that it is primarily tests that would used to think about future outcomes for which
evidence of predictive validity is obtained and would be of great significance in
evaluating test quality.
Any characteristic of a test that causes it to measure something other than the
behavior of interest decreases the validity of the test. “One of the most important
characteristics of a test that limits its validity is its reliability; when a test is unreliable, it
Psychometric Review
12
fails to measure consistently what it is intended to measure and, therefore, it is a less
valid measurement” (McCauley & Swisher, 1984, p. 35). “Unlike validity, which is a
unitary notion (there are different kinds of evidence but not different kinds of validity),
reliability is not unitary. That is, a test's reliability can be classified by the various sources
of error possible: (a) items and subtests. (b) examiners, (c) conditions of time and place,
and (d) test takers and standardization samples” (Hutchinson, 1996, p. 115). Reliability
refers to a test’s results being replicable. Furthermore, when administered properly, a test
gives consistent results on repeated administrations or with different interpreters judging
the same administration (Shipley & McAfee, 2009). For example, if a reliable articulation
test is used to measure speech production ability, a child should receive the same scores if
tested at different times on that test in the same day. For the purposes of this review, two
kinds of reliability will be discussed: test-retest and interexaminer reliability.
Test-retest reliability indicates the relationship between a child’s scores on a
single test when that test is administered to the same child on two occasions
(Andersson, 1995; Linn & Gronlund, 2000). For example, if the same child is tested
once and receives a standard score of 100 and then is tested 2 weeks later and
receives a standard score of 65, something has compromised the accuracy of the
results. It is necessary to provide a long enough time interval between the two
testing periods to allow for the child to minimize his familiarity with the test
materials but a short enough period that his test scores are not affected by
maturation or learning (Andersson, 2005; Linn & Gronlund, 2000). Therefore, a time
interval of 6 months would probably allow a child too much time to progress in their
speech production development and would most likely result is a disagreement
Psychometric Review
13
between scores of the first and second testing. If the child is tested 2 weeks after his
initial testing then the scores should remain similar to assure accuracy.
Interexaminer reliability indicates the relationship between scores received when
two or more independent examiners score the test (Andersson, 2005; Linn & Gronlund,
2000). Therefore, it is important that when the same child is tested by two different
clinicians, the results remain consistent. “Although some variation is expected, large gaps
between two examiners scores can be a reason for concern. For example, it does little
good to administer a test on which the same child might receive a standard score of 100
or a standard score of 65, depending on the conditions under which the test is
administered (e.g., who administers the test)” (Andersson, 2005, p. 214). If there is a
difference between a child’s scores when administered by two clinicians, the clinicans’
administrative and scoring techniques are not matching up and can cause
misinterpretation of the child’s speech sound abilities or diagnosis. To provide adequate
interscorer reliability, it is appropriate for the test publishers to provide the test user’s
with proper training on the test procedures prior to administration.
The Normative Sample
The standardization sample consists of the children to whom the test is
administered when the test is under development; it is usually intended to be
representative of the general population of children. “Selection of children for this
sample is important because the test developer uses the results for these children as
a basis for the test’s normative data (i.e., these data determine the correspondence
between the raw scores and the standard scores), allowing clinicians to compare the
performance of one child to that of the child’s peers” (Andersson, 2005, p. 208;
Psychometric Review
14
Salvia & Ysseldyke, 2004). “The manual should also provide enough details concerning
the standardization sample that a user can judge if the norms are appropriate for use with
a particular child or types of children to be evaluated. Two important criteria are the
reasonable representativeness and adequate size of the sample” (Hutchinson, 1996, p.
117). Therefore, a larger sample provides the normative data with a higher degree of
accuracy and a smaller degree of error. “This number should be large enough to
guarantee stability. In practice, 100 participants in each age or grade is considered
minimum” (Salvia, Ysseldyke, & Bolt, 2010, p. 50). When a wider range of children are
tested during development of the test, the results show more significance in
corresponding to the actual population represented. However, the use of a wider sample
size can also lead to concerns regarding dialect differences in pronunciation; therefore,
these situations must be noted.
It is important for there to be enough information given about the sample used for
clinicians to feel the instrument is adequate to judge their test taker’s speech sound
production abilities. Ideally, this normative data can be used to answer relevant questions
about test takers; however, unfortunately, most tests do not incorporate information to
answer if there is an impairment or not (McCauley & Swisher, 1984). Overall, the
normative sample should provide information to aid the test user in making decisions on
where a child’s abilities are lacking.
“Normative samples usually include individuals who represent the age and
demographic for whom the test is intended. For psychological tests, information
regarding the age, geographic residence, and socioeconomic status of individuals in the
normative sample should be supplied” (APA, 1974; McCauley & Swisher, 1984, p. 49).
Psychometric Review
15
Most often this sample is obtained from the general population; however, specific
populations can be used depending on the intention of the test (Peña, Spaulding, & Plante,
2006). For example, if a test was intended for use with individuals with dementia, the
normative sample would be dependent upon individuals with the diagnosis of memory
loss to assure an appropriate comparison in the results. Therefore, when testing a person
for dementia, it is necessary to have a normative sample that represents that population to
create cutoffs to determine presence of the condition and its severity. Accordingly, tests
involving normative data may potentially be of value to tests that are aimed at speakers
with given accents (i.e. southern speakers or speakers of African American vernacular
English).
However; when testing children for speech sound production ability, clinicians
are often looking for whether an impairment exists rather than for help in quantifying
severity. This calls into question whom that child should be compared with in order to
make this decision. Therefore, it is important to indicate if a test excludes individuals on
basis of disability or nonnormal language ability. Although such exclusions may seem to
make sense when one wants to ask whether a child’s performance is similar or dissimilar
to the performance of “normal” or “normal speaking” children, this approach to choosing
the normative sample presents some difficulties. One such problem is that even the most
deviant scores contained in the normative sample represent normal performance.
Historically, some authors (McCauley & Swisher, 1984; p. 49; Salvia & Ysseldyke, 1981)
have suggested that a child receiving a score below the lowest score received by the
normative sample has received a score with an unknown probability and may or may not
represent nonnormal performance if all of the participants in the norming study were
Psychometric Review
16
thought to have typical speech or language. “In the field of communication disorders, it is
often interpreted to mean that tests should include children with documented language
impairment (LI) in the normative group” (Peña, Spaulding, & Plante, 2006, p. 248).
The desire to look at recent speech sound production tests led to adoption of
methodology assessing major issues visited in an earlier study. Thus, the purposes of this
study were to (a) examine the psychometric quality of 9 tests that help speech-language
pathologists assess the speech sound production of preschool children and (b) compare
these results with those obtained since a similar study conducted 25 years ago (McCauley
& Swisher, 1984).
Methods
Test Search Strategy
Initially, a list of 376 language and speech sound production tests was compiled
as possible candidates for use in the review. These tests were gathered from two primary
sources: the Buros Institute Test Review’s Online (Buros Institute, 2009) as well as
American Speech-Language-Hearing Association Online Catalogue (ASHA, 2009).
Publisher websites were also examined. Key terms used in searching included all of the
following: articulation, phonology, speech, language, apraxia, and oral motor.
We selected tests for review if they (a) were norm-referenced instruments; (b)
included preschool aged children (i.e. children at or below the age of 4 years old); (c)
were not considered a screening instrument, (d) were available in March or April 2010
through a commercial source, and (e) were published in or after 1998. From these, we
excluded tests if more detailed examination revealed any of the following: (a) their
material focused on oral motor skills or apraxia, (b) the test was designed for non-native
Psychometric Review
17
English speakers, or (c) the test was designed for the hearing impaired. These criteria
were used in order to focus on tests most likely to help us answer questions related to the
psychometric characteristics of speech sound production tests for preschool children.
Psychometric Criteria for Review
“The criteria in this review consist of a set of characteristics that should be
considered when a clinician chooses or makes use of a norm-referenced test to evaluate a
child’s speech status. They were chosen due to their recognized importance and relevance
to tests of articulation, and because they could be translated into relatively objective
decision rules” (APA, 1974; McCauley & Swisher, 1984, p. 37; Salvia & Ysseldyke,
1981). The failure of a test to meet these criteria or provide information on them has
serious effects on the merits of a test-no matter how in harmony the content of a given
test is with the test user’s concept of the skill being tested.
Because this study is a replication of McCauley & Swisher (1984), the first 10
criteria are based exactly on what was used in the 1984 study; however, the eleventh
criterion was added due to its significance to identifying an instrument as a valid measure.
In fact, these criteria and the consequences of their not being met are stated exactly as
they were in McCauley and Swisher (1984) and in McCauley and Strand (2008):
Criterion 1. The test manual should clearly define the standardization sample so
that the test user can examine its appropriateness for a particular test taker (APA,
1974, pp. 20-21; McCauley & Swisher, 1984; Weiner & Hoock, 1973). To pass
this criterion, a test needed to give three pieces of information considered
important for speech testing: (a) the normative sample’s geographic residence, (b)
socioeconomic status (family income), (c) the “normalcy” of subject in the same,
Psychometric Review
18
including the number of individuals excluded because they exhibited nonnormal
speech or nonnormal development. If any of the information stated prior was not
present in the test manual, that test did not meet the criterion. It was necessary
that all parts of the standardization sample information be present to meet this
criterion.
Consequences if unmet. Without this information, the test user cannot tell whether
the normative sample is representative of the test author's intended population and
whether it is the population against which the test taker's performance should be
compared. For example, the way in which the standardization sample was chosen
may have resulted in the use of children with a very different language
background from that of the test taker. In such a case, the comparison of the test
taker to the norms would be of little value if the clinician were interested in the
normalcy of the test taker's language, but would be of considerable value if the
clinician were interested in whether the test taker's language differed from that of
the normative group.
Criterion 2. For each subgroup examined during the standardization of the test, an
adequate sample size should be used. In order to pass this criterion, a test needed
to have subgroups with a sample size of 100 or more. This particular value is
consistently referred to by authorities as the lower limit for adequate sample size
(APA, 1974, pp. 27-28, 37; McCauley & Swisher, 1984). Furthermore, it was
necessary to examine if the age intervals in the standardization sample were
consistent with the age intervals in the raw score conversion data in the appendix
of the test manual. If these age intervals did not match and could not be proven to
Psychometric Review
19
provide each subgroup with at least 100 subjects, this criterion was considered
unmet.
Consequences in unmet. If a small sample size is used, the norms are likely to be
less reliable, that is, less stable; therefore, use of a different group of children
might have resulted in different norms. In addition, relatively rare individuals
(e.g., language- or articulation-impaired children) might not have been included in
the sample because of the small number of subjects. This possibility makes the
interpretation of the scores of possibly impaired children difficult. The smaller the
sample size, the greater are the problems (Salvia & Ysseldyke, 1981, p. 123;
Weiner & Hoock, 1973).
Criterion 3. The reliability and validity of the test should be promoted through the
use of systematic item analysis during item construction and selection (Anastasi,
1976, p. 198; McCauley & Swisher, 1984). To pass this criterion, the test manual
needed to report evidence that quantitative methods were used to study and
control item difficulty item validity, or both. Therefore, content validity had to be
present showing item relevance as well as content coverage.
Consequences in unmet. Although the consequences would be less important if all
other criteria to be discussed here were met by a test, that is rarely the case. Few
norm-referenced tests provide adequate evidence that the test measures accurately
what it purports to measure. This criterion, then, serves as an alternative
indication that the test may possess validity and reliability. According to Anastasi
(1976, p. 198), "high reliability and validity can be built into a test in advance
through item analysis."
Psychometric Review
20
Criterion 4. Measures of central tendency and variability of test scores should be
reported in the manual for relevant subgroups examined during the objective
evaluation of the tests (APA, 1974, p. 22; McCauley & Swisher, 1984). To pass
this criterion, both the mean and the standard deviation had to be given for the
total raw scores of all relevant subgroup.
Consequences if unmet. The mean is the average score received by members of
the normative subgroup. The standard deviation gives the test user an estimate of
how much variation was shown by the scores received by subgroup members.
Because these pieces of information can also serve as the basis for other ways of
presenting the norms (e.g., z scores), their absence robs the test user of flexibility
in the use of the test norms.
Criterion 5. Evidence of concurrent validity should be supplied in the test manual
(APA, 1974, pp. 26-27; McCauley & Swisher, 1984). To pass this criterion, the
test manual needed to provide empirical evidence that categorizations of children
as normal or impaired obtained using the test agree closely with categorizations
obtained by other methods that be considered valid, for example, clinician
judgments or scores on other validated tests.
Consequences in unmet. The absence of this kind of evidence calls into question a
test's ability to help with assessment questions related to the existence of
impairment. Because the reason for using a norm-referenced test of language or
articulation is to enable the test user to compare a child's score against the scores
of other children as an aid in the determination of normalcy, the failure of a test
on this criterion should cause the test user to question the usefulness of that test.
Psychometric Review
21
Criterion 6. Evidence of predictive validity should be supplied in the test manual
(APA, 1974, pp. 26-27; McCauley & Swisher, 1984). To pass this criterion, a test
manual needed to include empirical evidence that it could be used to predict later
performance on another, valid criterion of the speech behavior addressed by the
test in question.
Consequences if unmet. Evidence of this kind of predictive validity enables the
test user to use a test to make assessment decisions related to the need for therapy.
Therefore, its absence means that other possibly invalid sources of information
will be weighed more heavily in the decision process.
Criterion 7. An estimate of test-retest reliability for relevant subgroups should be
supplied in the test manual (APA, 1974, pp. 50, 54; McCauley & Swisher, 1984).
To pass this criterion, the test manual needed to supply empirical evidence of test-
retest reliability, including a correlation coefficient of .90 or better (Salvia &
Ysseldyke, 1981, p. 98) that was statistically significant at or beyond the .05 level
(Anastasi, 1976, pp. 108-109). If statistical significance was not addressed in the
test manual, this criterion was marked as unmet.
Consequences in unmet. Without this information, the test user does not know to
what extent test results are stable and to what extent they will fluctuate over time.
If this kind of reliability is low, the test should be viewed with the suspicion one
would have of a rubber ruler. A correlation coefficient of .90 is considered a
minimum standard of reliability when a test is used to make decisions regarding
the existence of a problem (Salvia & Ysseldyke, 1981, p. 98). If in addition to
being relatively high the correlation coefficient is not also statistically significant,
Psychometric Review
22
that fact may be due to the specific characteristics of the individuals chosen as
part of the normative sample, but would not be duplicated if the standardization
process were repeated with other, similar groups.
Criterion 8. Empirical evidence of interexaminer reliability should be given in the
test manual (APA, 1974, p. 50; McCauley & Swisher, 1984). To pass this
criterion, a test manual needed to report evidence of interexaminer reliability that
included a correlation coefficient of .90 or better (Salvia & Ysseldyke, 1981, p.
98) that was statistically significant at or beyond the .05 level (Anastasi, 1976, pp.
108-109). If statistical significance was not addressed in the test manual, this
criterion was marked as unmet.
Consequences if unmet. Without an estimate of interexaminer reliability, the test
user does not know the degree to which a test taker is likely to receive similar
scores if the test is given again by different individuals or if the same test is
scored by different individuals. Neither can it be known whether the tester is
likely to affect the scores of test takers in a way that improperly penalizes or
favors them.
Criterion 9. Test administration procedures should be described in sufficient
detail to enable the test user to duplicate the administration and scoring
procedures used during test standardization (APA, 1974, p. 18; McCauley &
Swisher, 1984). To pass this criterion, the test manual needed to provide sufficient
description so that, after reading the test manual, the reviewer believed she could
administer and score the test without grave doubts about correct procedures.
Psychometric Review
23
Consequences if unmet. Unless the administration process is described in detail,
the test user does not know whether it is reasonable to compare the test taker's
performance to the norms. Information supplied in the test manual should include
specific instructions on how to administer, score, and interpret the test and on the
setting in which testing should be conducted. If a test is administered without
duplicating the procedures followed during standardization, the test taker may be
given an unfair advantage or may be unfairly penalized by differences in
instructions, surroundings, and so forth.
Criterion 10. The test manual should supply information about the special
qualifications required of the test administration or scorer (APA, 1974, p.15;
McCauley & Swisher, 1984; Salvia & Ysseldyke, 1981, p.18). To pass this
criterion, the test manual needed to state both general and specialized training
required for administrators and scorers.
Consequences if unmet. Information of this kind should be given for all tests
because the administration, scoring, and interpretation of test results should only
be done by a qualified person. Moreover, for rating scales this information is
particularly crucial because such scales use observations as test data. Without this
information, it is impossible for the test user to judge the quality of the data
obtained with the test.
Criterion 11. The test manual needed to supply evidence of construct validity in
the test manual. To pass this criterion, the test manual needed to address any of
the following to meet the operational definition: (a) evidence from a factor
analytic study confirming expectations of the test’s internal structure, (b) evidence
Psychometric Review
24
that test performance improves with age, (c) evidence that groups that were
predicted to differ in test performance actually do so. In addition, evidence needed
to be obtained within a study in which statistical methods and participants were
described (McCauley & Strand, 2008).
Consequences if unmet. “Almost any kind of information about a test can
contribute to an understanding of its construct validity, but the contribution
becomes stronger if the degree of fit of the information with the theoretical
rationale underlying score interpretation is explicitly evaluated” (Hutchinson,
1996, p. 111; Messick, 1989, p.17). Therefore, without the evidence of construct
validity present, the test is lacking in its strength to provide clinicians with an
instrument that can feel comfortable with and view as accurate.
The review of each test manual was first carried out by the author using a scoring
sheet that reiterated each of the operational criteria described above. Both the author and
a second reviewer, who was also an undergraduate, conducted a similar study scoring 2
norm-referenced tests that did not address speech sound development before beginning
the review process—by which time they had achieved an agreement of 91%.
Once all of the available tests were reviewed by the author, the second reviewer
randomly selected 20 percent (n=2) of the tests for review to ensure accuracy. The other
undergraduate reviewer was familiar with the criteria and review process of this study as
she was conducting a similar study in which she reviewed 16 norm-referenced language
tests for preschool children prior to reviewing the speech sound production tests for
agreement.
Psychometric Review
25
Results
Only 9 tests were identified through the selection process.
The second examiner’s ratings agreed with those of the first examiner on 95.5%
of the 22 ratings (2 tests x 11 criteria). Percentage agreement was 10 out of 11 criteria
(91%) for one test and 11 out of 11 criteria agreed (100%) on the other. For each rating of
pass or fail on which the examiners disagreed, the test manual was consulted and the
disagreement was reconciled.
In Table 1, the number of tests meeting each criterion is summarized for the final
rating results. Results for the 9 speech tests results were combined to calculate the
number and percentage of tests meeting that criterion. The majority of the tests met over
half of the 11 psychometric criteria. The criteria that were most frequently met were
description of test procedures (Criterion 9), description of examiner qualifications
(Criterion 10), and evidence of construct validity (Criterion 11). All of the tests met these
criteria. The two criteria related to evidence of content validity (Criterion 3) and reporting
means and standard deviations (Criterion 4) were met by 8 of the reviewed tests.
The criteria that were most frequently unmet dealt with evidence of test-retest
reliability (Criterion 7) and evidence of interexaminer reliability (Criterion 8). None of
the tests met these criteria based on the operational definitions for test-retest and
interexaminer reliability. Although majority of the tests met the correlation coefficient of
at least a .90, these tests did not address the use of statistical significance in their
evidence resulting in failure of those criteria.
Psychometric Review
26
The criteria dealing with description of the normative sample (Criterion 1) and
evidence of predictive validity (Criterion 6) were each only met by one of the tests. Due
to information regarding the samples geographic region, socioeconomic status, and
normalcy needed to pass Criterion 1, many of the tests only provided information on one
of those areas but not all. Many of the reviewed tests used the mother’s educational level
as the socioeconomic status of the sample; however, the family’s income level was
needed to meet the requirements of socioeconomic status in this review.
Table 2 lists the names of tests meeting each criterion. Although certain names
appear more often than others, the pattern of met criteria observed in Table 1 is not due to
the existence of a few tests that met almost all criteria while most met none. This fact is
illustrated graphically in Figure 1 which records the number of tests that met one or more
criteria, two or more criteria, and so on. The largest number of criteria met by any test
was 7 criteria which were met by 5 tests. The lowest number of criteria met by one test
was four. The median number of criteria met by all tests was six.
Results Compared to Results from McCauley & Swisher (1984) When comparing the changes that have occurred in the past 25 years as reflected
in the results of this study and those of McCauley and Swisher (1984), one can feel
confident in the progress of speech sound production tests’ inclusion of psychometric
information. Whereas, in 1984 only 5 tests were available for review, that number has
nearly doubled to 9 in 2010. This provides proof to the statement that clinicians have
several more choices available to them when choosing an instrument to assess a child.
The number of criteria met by tests was minimal in the 1984 (McCauley &
Swisher) review; however, these numbers have strengthened according to the results of
Psychometric Review
27
this study. Out of the 5 articulation tests for preschool children reviewed by McCauley &
Swisher (1984), the highest number of criteria by a test was 3 of the 10 (30%). Otherwise,
2 of the tests met at least 2 of the criteria and 4 of tests met at least 1 of the criteria. In
contrast, in this study, the highest number of criteria met by a test was 7 of the 11 (64%)
criteria and this number was met by 5 of the tests. However, all the tests met at least 4 of
the criteria and 8 of the tests reviewed met 5 criteria. These numbers can be interpreted as
good progress on the inclusion of psychometric data of speech sound production tests
over the last 25 years.
Discussion
Readers can compare findings across more than two decades by examining
Figures 1, which show number of tests reviewed that met at least n criteria in McCauley
and Swisher (1984) and the number of criteria met by tests reviewed in this study.
The number of criteria addressed by the tests reviewed in the 1984 (McCauley &
Swisher) study and those in this study have significantly increased. In the 1984
(McCauley & Swisher) review, the results showed that 7 of the 10 (70%) criteria were
not met by any tests. However, in this study, that number had decreased to 2 out of 11
(18%) criteria remaining unmet by the tests. Accordingly, this study reveals that only 3
out of 10 (30%) were met by tests in McCauley & Swisher’s (1984) review while 9 out of
11 (82%) were met by tests in this review. The only criteria that were met in 1984
(McCauley & Swisher), were those dealing with Content Validity and Item Analysis
(Criterion 3), means and standard deviations (Criterion 4), and description of test
procedures (Criterion 9). In this study, those criteria were those addressed most by the
tests reviewed. However, as stated previously, those criteria related to reliability were
Psychometric Review
28
those that still require improvement in speech tests for preschool children. Tables 1 and 3
show the comparison of McCauley and Swisher (1984) and this study as how many of
each criterion was met by the number of tests (percentage).
The pattern that emerges from this review may be somewhat discouraging.
However, in comparison the results of the 1984 study, there have been a significant
improvement in the inclusion of these psychometric criteria in the past 25 years. Similar
to the finding of McCauley & Swisher (1984), the two criteria that would require the
small amount of time and financial resources were met most often- Criterion 9
(description of test procedures) and Criterion 10 (description of examiner qualifications).
This is somewhat predictable based on the knowledge, time, and money that are required
when addressing psychometric data dealing with reliability and validity which were the
criteria most frequently unmet (McCauley & Swisher, 1984, p. 40).
On the other hand, it can be reassuring to speech-language pathologists that
speech sound production test publishers are making an effort to address the psychometric
information appropriate of a norm-referenced test. Although Criterion 11 (evidence of
construct validity) was added to the 10 criteria used in the original study, all of the tests
were able to meet the requirements of that criterion. Due to the operation definition of
construct validity asking for evidence of content, concurrent, and predictive validity, all
of the tests reviewed showed some form of empirical evidence of validity. As for the
continuing failure of reliability, the use of statistical significance may be a given when
calculating test-retest and interexaminer reliability for tests using large standardization
samples and, therefore, is not recorded as a part of the results. However, because this
Psychometric Review
29
cannot be confirmed by the information given, these criteria are still left unmet in all tests
reviewed.
Summary
All of the current 9 commercially available norm-referenced tests of speech sound
production met at least 4 of the 11 criteria examined. Five of them (about half) met 7 of
the 11 criteria used in this study. Criteria that tended to be met by all (n=9) or almost all
(n=8) of the currently available tests related to the description of test procedures,
examiner qualifications, and of the normative data (reporting of means and standard
deviations) as well as to evidence of validity (content and construct). However, test-retest
and inter-examiner reliability were not adequately reported by any of the tests in the
current group.
Whereas in 1984, only 5 tests were available for review, that number has nearly
doubled to 10 in 2010. The number of criteria that were met also essentially doubled.
Whereas in 1984, the maximum percent of criteria met was 30% (3 of 10), which was
met by only one test; for the current group of tests the maximum number of criteria met
was 7 of 11(64%) and more than half of the reviewed tests met that number. A continuing
weakness of tests in this area was the absence of test-retest and inter-examiner
agreement; however, one aspect of the criteria used to review these psychometric
characteristics may have been responsible for this outcome and will require further
investigation.
In short, although speech sound production tests appear to be improving in their
psychometric quality, more can still be done to produce the best possible tools for use
with preschool children.
Psychometric Review
30
References
American Psychological Association. (1974). Standards for educational and
psychological tests. Washington, DC: APA.
American Speech-Language-Hearing Association. (2009). Directory of Speech-Language
Pathology Assessment Instruments. Retrieved November 14, 2009, from
www.asha.org/SLP/assessment/.
Anastasi, A. (1976). Psychological testing (4th ed.). New York: Macmillan.
Andersson, L. (2005). Determining the adequacy of tests of children’s language.
Communication Disorders Quarterly, 26(4), 207-225.
Buros Institute. (2009). Test Reviews Online. Retrieved October, 16, 2009, from
www.unl.edu/buros.
Hutchinson, T. A. (1996). What to look for in the technical manual: Twenty questions for
users. Language, Speech, and Hearing Services in Schools, 27, 109-121.
Linn, R., & Gronlund, N. (2000). Measurement and assessment in teaching.
Upper Saddle River, NJ:Merrill/Prentice Hall.
McCauley, R. J., & Strand, E. A., (2008). A review of standardized tests of nonverbal
oral and speech motor performance in children. American Journal of Speech-
Language Pathology, 17, 81-91.
McCauley, R. J., & Swisher, L. (1984). Psychometric review of language and articulation
tests for preschool children. Journal of Speech and Hearing Disorders, 49, 34-42.
Messik, S. (1989). Meanings and values in test validation: The science and ethics of
assessment. Educational Researcher, 18, 5-11.
Psychometric Review
31
Peña, E. D., Spaulding, T. J., & Plante, E. (2006). The composition of normative groups
and diagnostic decision making: Shooting ourselves in the foot. American Journal
of Speech-Language Pathology, 15, 247-254.
Plante, E., & Vance, R. (1994). Selection of preschool language tests: A data based
approach. Language, Speech, and Hearing Services in Schools, 25, 19-24.
Salvia, J., & Ysseldyke, J. (1981). Assessments in special and remedial education (2nd
ed.). Boston: Houghton Mifflin.
Salvia, J., & Ysseldyke, J. E. (2004). Assessment: In special and inclusive education
(9th ed.). Boston: Houghton Mifflin.
Salvia, J., Ysseldyke, J.E., & Bolt, S. (2010). Assessment in special and inclusive
education (11th Edition). Belmont, CA: Wadsworth Cengage Learning.
Shipley, K. G., & McAfee, J. G. (2009). Assessment in speech-language pathology: A
resource manual (4th ed). New York: Delmar Cengage Learning.
Weiner, P., & Hoock, W. (1973). The standardization of tests: Criteria and criticisms.
Journal of Speech and Hearing Research, 16, 616-626.
Psychometric Review
32
Appendix. Reviewed Tests
Dawson, J.I., & Tattersall, P.J. (2001). Structured Photographic Articulation Test II
(SPAT-D II). DeKalb, IL: Janelle.
Dodd, B., Hua, Z., Crosbie, S., Holm, A., & Ozanne, A. (2006). Diagnostic Evaluation of
Articulation and Phonology (DEAP). San Antonio, CA: Pearson Assessments.
Fudala, J. B., & Reynolds, W. (2000). Arizona Articulation Proficiency Scales- Third
Revision (Arizona-3). Los Angeles: Western Psychological Services.
Goldman, R., & Fristoe, M. (2000). Goldman-Fristoe Test of Articulation- Second
Edition (GFTA-2). Circle Pines, MN: AGS.
Hodson, B. W. (2003). Hodson Assessment of Phonological Patterns- Third Edition
(HAPP-3). Austin, TX: Pro-Ed.
Khan, L. M., & Lewis, N.P. (2002). Khan-Lewis Phonological Analysis- Second Edition
(KLPA-2). Circle Pines, MN: AGS.
Lowe, R. J. (2000). Assessment Link between Phonology and Articulation, Revised
(ALPHA-R). Mifflinville, PA: ALPHA Speech & Language Resources.
Secord, W., & Donohue, J. (1998). Clinical Assessment of Articulation and Phonology
(CAAP). Greenville, SC: Super Duper.
Wilcox, K., & Morris, S. (1999). Children’s Speech Intelligibility Measures (CSIM). San
Antonio, TX: The Psychological Corporation.
Psychometric Review
33
Table 1.
The number of tests meeting each of the 11 psychometric criteria. (Percentage of tests
meeting each criterion is given in parentheses).
Criterion Number (Percentage) Tests
Meeting Criterion
1. Description of normative sample 1 (11%)
2. Sample Size 3 (33%)
3. Content Validity- Item Analysis 8 (89%)
4. Mean and Standard Deviation 8 (89%)
5. Concurrent Validity 6 (67%)
6. Predictive Validity 1 (11%)
7. Test-retest Reliability 0
8. Inter-examiner Reliability 0
9. Description of Test Procedures 9 (100%)
10. Description of Examiner Qualifications 9 (100%)
11. Construct Validity 9 (100%)
Psychometric Review
34
Table 2.
Tests meeting each criterion. Test name abbreviations are those given in the Appendix.
Criterion Number
of Tests
Tests
1. Description of normative
sample
1 HAPP-3
2. Sample size 3 Arizona-3, CAAP, SPAT-D-II
3. Content Validity- Item
Analysis
8 Arizona-3, CAAP, CSIM, DEAP, GFTA-2,
HAPP-3,
KLPA-2, SPAT-D-II
4. Means and Standard
Deviations
8 ALPHA-R, Arizona-3, CAAP, DEAP, GFTA-2,
HAPP-3, KLPA-2,
SPAT-D-II
5. Concurrent Validity 6 Arizona-3, CAAP, CSIM, DEAP, HAPP-3,
SPAT-D-II
6. Predictive Validity 1 DEAP
7. Test-retest Reliability 0 -
8. Inter-examiner Reliability 0 -
9. Description of test
procedures
9 ALPHA-R, Arizona-3, CAAP, CSIM, DEAP,
GFTA-2, HAPP-3, KLPA-2, SPAT-D-II
10. Description of examiner
qualifications
9 ALPHA-R, Arizona-3, CAAP, CSIM, DEAP,
GFTA-2, HAPP-3, KLPA-2, SPAT-D-II
Psychometric Review
35
11. Construct Validity 9 ALPHA-R, Arizona-3, CAAP, CSIM, DEAP,
GFTA-2, HAPP-3, KLPA-2, SPAT-D-II
Psychometric Review
36
Table 3. The number of tests meeting each of the 11 psychometric criteria in comparison
with McCauley & Swisher (1984) results. (Percentage of tests meeting each criterion is
given in parentheses).
McCauley & Swisher
(1984)
Mathias (2010)
Criterion Number (Percentage) of
Tests Meeting Criterion
Number (Percentage) of
Tests Meeting Criterion
1. Description of
normative sample
0 1 (11%)
2. Sample size 0 3 (33%)
3. Content Validity- Item
Analysis
1 (20%) 8 (89%)
4. Means and Standard
Deviations
2 (40%) 8 (89%)
5. Concurrent Validity 0 6 (67%)
6. Predictive Validity 0 1 (11%)
7. Test-retest Reliability 0 0
8. Inter-examiner
Reliability
0 0
9. Description of test
procedures
4 (80%) 9 (100%)
10. Description of
examiner qualifications
0 9 (100%)
Psychometric Review
38
List of Figures Figure 1. The number of 9 reviewed tests that met at least n criteria (Mathias, 2010) in
comparison with the number of 5 reviewed tests that met at least n criteria (McCauley &
Swisher, 1984).