Psychometric limitations of the Personality Assessment Inventory: A reply to Morey's (1995)...
Transcript of Psychometric limitations of the Personality Assessment Inventory: A reply to Morey's (1995)...
Bond UniversityePublications@bond
Humanities & Social Sciences papers Faculty of Humanities and Social Sciences
6-1-1996
Psychometric limitations of the PersonalityAssessment Inventory: A reply to Morey's (1995)rejoinderGregory J. BoyleBond University, [email protected]
Follow this and additional works at: http://epublications.bond.edu.au/hss_pubsPart of the Psychology Commons
This Journal Article is brought to you by the Faculty of Humanities and Social Sciences at ePublications@bond. It has been accepted for inclusion inHumanities & Social Sciences papers by an authorized administrator of ePublications@bond. For more information, please contact Bond University'sRepository Coordinator.
Recommended CitationGregory J. Boyle. (1996) "Psychometric limitations of the Personality Assessment Inventory: A replyto Morey's (1995) rejoinder" Journal of Psychopathology and Behavioral Assessment, 18 (2), 197-203:ISSN 1573-3505.
http://epublications.bond.edu.au/hss_pubs/791
1
Psychometric Limitations of the Personality Assessment
Inventory: A Reply to Morey's (1995) Rejoinder
Gregory J. Boyle
Department of Psychology
Bond University
Note. Morey referred to the PAI instrument as a "test." However, it is not really
appropriate to refer to self-report personality scales as "tests," particularly when
there are no right or wrong answers, as such. Only objective, performance
measures (T-data) can rightly claim the title "test" (cf. Cattell & Kline, 1977).
2
Some psychometric problems with the Personality Assessment Inventory
(PAl) were observed by Boyle and Lennon (1994). However, Morey (1995)
asserted there were "several methodological and conceptual limits" and alternative
explanations of the Boyle and Lennon data. Although Morey asserted that age and
clinical status were confounded, Boyle and Lennon statistically partialled out
variance due to age, using ANCOVA procedures. Morey's description of Boyle
and Lennon's sample as "unusual" was strange-schizophrenic and alcoholic
patients in a psychiatric hospital comprised the clinical groups-the PAl was
designed specifically to assess psychopathology in such patients. Although Morey
claimed that alpha coefficients were misinterpreted, Boyle and Lennon based their
conclusions solely on the obtained coefficients. Morey's attempt to downplay the
finding of suboptimal stability for several PAI scales also runs counter to the
empirical results actually observed. Finally, Morey attempted to minimize the role
of factor analysis in investigating construct validity, apparently to deflect attention
from deficiencies in the factor analysis of a clinical sample reported in the PAl
manual.
Morey (1995) criticized Boyle and Lennon's (1994) finding that the
median test-retest reliability coefficient for the Personality Assessment Inventory
(PAI) 2 scales measured over a 28-day interval was only .73. Morey claimed that
there was a restriction in the range of PAl scores, so that the less than optimal
reliability was largely a statistical artifact. Yet perusal of actual scores obtained by
a large subsample of 70 normal individuals from the general adult population at
large (in whom personality characteristics would be expected to be relatively
stable across an interval of only 4 weeks) revealed a wide range of PAl scores,
despite Morey's suggestion to the con-trary (mean scale scores alone ranged from
3
2.80 to 22.16, and standard deviations ranged from 2.75 to 9.67). In any event,
restriction of variance was not a particularly problematic issue in Boyle and
Lennon's study, because the primary concern was whether or not PAl scale scores
were ranked similarly or dissimilarly by the normal group on each measurement
occasion. Had the PAl scales been more stable, a greater similarity of rankings
would have been observed, than was the case. Although Morey regarded his
argument as "self-explanatory," the alleged "restriction of range" of PAl scores
within Boyle and Lennon's normal sample was not supported by the actual
empirical evidence.
Morey argued about minimum reliability "cutoff" points, despite the
absence of any such claim by Boyle and Lennon. Ideally, if perfectly reliable, PAl
scales would have exhibited stability coefficients of 1.00. A median coefficient of
.73 indicates that half the observed stability coefficients were actually lower than
this value-a clearly undesirable finding for a personality trait inventory. Indeed,
only 53% of the variance was common across both measurement occasions (i.e., at
least 47% of the measurement variance was associated with dissimilar rankings of
PAl scores).
In calculating test-retest coefficients, Morey stated, “admittedly, in the
psychopathology field this is rather difficult, since ethical considerations preclude
withholding treatment for purposes of generating reliability estimates." It was
precisely because of this dilemma that Boyle and Lennon chose to assess stability
in a normal sample-which was expected to produce upper-bound estimates of
reliability. There is no a priori reason not to investigate the psychometric
properties of a clinical personality instrument in more readily accessible, non-
psychiatric samples. For Morey to claim that, "this artifact must be recognized as a
4
complication" seems strange. This assertion was not based on empirical evidence,
but was merely presumed. Morey set up "a straw man" and then proceeded to
"knock it down." Scientific discourse should be based on concrete evidence, rather
than on unsupported allegations.
Morey stated that, "the PAl includes measures of anxiety, depressed affect,
and suicidal ideation, features that might be expected to fluctuate widely over the
course of one month." That might be true for certain clinical samples but would
not be expected among psychologically stable normal individuals. Indeed, the
proportion of variance common across both measurement occasions for the
anxiety scale was only 38%, so that no less than 62% of variance was due to
dissimilar rankings of scores, even among normal, psychologically stable
individuals. Likewise, for the Paranoia (PAR), and Antisocial Features (ANT)
scales, 58% and 60% of the variance was discrepant-i.e., more of the variance was
unreliable than reliable!
Using normal samples himself (despite his criticism of Boyle & Lennon),
Morey (1991) reported a stability coefficient of .90 for the Antisocial Features
(ANT) scale, thereby contradicting his assertion that normal samples are
inappropriate for evaluating the psychometric properties of the PAl instrument
and, by implication, contradicting his (1995) claim that there is undue restriction
of range in scores for normal samples. Evidently, Morey's argument about
"restriction of range" was refuted by his own empirical findings! In stating that the
WAIS-R scales do not all exhibit retest coefficients of .80, Morey referred to
intelligence as "a more stable construct" but failed to acknowledge that intellectual
functioning fluctuates markedly in response to biological, neuropsychological, and
situational factors (see Stankov, Boyle, & Cattell, 1995). In any event, sub-optimal
5
stability of WAIS-R scales would suggest that the instrument could stand
considerable refinement, especially in light of the work showing that such models
of intelligence are overly narrow (e.g., Boyle, 1995a; Cattell, 1987; Gardner,
1993; and Sternberg, 1994).
The finding of less than optimal test-retest coefficients led Boyle and Lennon to
point to the possibility that the PAl scales exhibit only relative stability, somewhat
akin to dynamic traits (cf. Cattell & Child, 1975), rather than more enduring
personality traits. Although Morey claimed that Boyle and Lennon had confused
dynamic traits with states, Morey himself failed to make the necessary conceptual
discriminations. As Boyle had pointed out in many previous papers (e.g., Boyle,
1979, 1983, 1985, 1988; Boyle & Cattell, 1984), stable trait measures should
exhibit high immediate test-re- test (dependability) coefficients, as well as high
longer-term retest (stability) coefficients, whereas transitory state measures should
exhibit high depend- ability but considerably lower stability-if the measures are
truly sensitive to situational fluctuations on different measurement occasions (cf.
Fernan- dez, 1990; Fernandez, Nygren, & Thorn, 1991). Dynamic traits [including
motivational dynamic traits such as those measured in the objective IPAT
Motivation Analysis Test (see Cattell, 1992)] theoretically should exhibit test-retest
coefficients intermediate in magnitude between those observed for stable traits, on
the one hand, and transitory states, on the other (see Cattell, 1973, p. 354) for a
detailed account of consistency coefficients and their implications for state,
dynamic trait, and enduring trait scales]. Without considering the complex issues
regarding test-retest reliability, as a function of the state-dynamic-trait continuum,
Morey's denial of the observed instability of several of the PAl scales cannot be
sustained logically.
6
With regard to item homogeneity estimates, Morey claimed that Boyle and
Lennon interpreted high alpha coefficients as "bad." Despite being a rather
dogmatic assertion, Morey contradicted himself when admitting that "internal
consistency can be too high." In fact, as Boyle (1991) has discussed, both "internal
consistency" and "item redundancy" are value judgments. Cronbach alpha
coefficients merely indicate the level of item homogeneity of a scale. Morey
further claimed that Boyle and Lennon's findings only raised the possibility of
excessive item redundancy in the PAl scales. Yet the obtained median coefficient
was no less than .83, indicating that for half of the PAl scales, alpha coefficients
were exceptionally high, exceeding this figure (cf. Boyle, 1991). Morey asserted
that, "There are simply too many influences upon the alpha coefficient to warrant
the conclusion that high alphas invariably indicate problems." However, Boyle
and Lennon only stated (p. 182) that the observed high alpha coefficients
"suggested the possibility of rather narrow scales, with excessive item
redundancy." (cf. Boyle, 1991).
Morey subsequently asserted that "high internal consistency is obviously
not a problem if a scale can be validated against external criteria." However,
Morey has not considered the implications of the breadth of measurement of a
construct. As Boyle (1991) pointed out, high alpha coefficients can be obtained if
all items in a scale are merely paraphrases of each other but the construct is
therefore being measured in a very limited, narrow fashion, and many of the items
are redundant and could be dispensed with. Just because an instrument is popular
(Morey referred to the Beck Depression Inventory and the Hamilton Rating Scale
as good measures against which to assess the external or concurrent validity of the
PAl), it does not follow that such instruments necessarily make good external
7
criterion measures, especially since both the BDI and the HRS also have some
psychometric deficiencies (see Boyle, 1985).
As for prevalence rates of alcoholic problems, it is unlikely that there are
major differences between the United States and Australia. Thus, 18% would be
the comparable figure for Queensland. Nevertheless, Morey was correct in
highlighting the particular composition of the non-psychiatric sample, since the
sample composition may have contributed to the apparent number of false
positives on the Alcohol Problems (ALC) scale. In these circumstances, it seems
likely that the ALC scale did not seriously overestimate the number of alcoholic
cases.
Morey questioned Boyle and Lennon's finding that several PAl scales did
not discriminate between schizophrenic and alcoholic clinical samples, and
asserted that there is no need for all PAl scales to exhibit discriminative validity.
However, it would seem reasonable to expect that schizophrenic and alcoholic
patients should differ significantly on the PAl scales labelled Schizophrenia
(SCZ), Dominance (DOM), and Warmth (WRM). That these scales failed to
differentiate between the two clinical groups raises questions concerning their
validity. Alluding to the inadequate discriminative validity of the MMPI clinical
scales (see the comprehensive critique by Helmes & Reddon, 1993) does not
detract from the poor discriminative validity of several of the PAl scales (and of
abnormal personality instruments, in general). Furthermore, Morey's suggestion
about removing protocols with high Negative Impression (NIM) may not
necessarily improve PAl scale discriminative validities.
Morey's criticism of the possible confounding effect of age differences in
the clinical and non-psychiatric groups ignored the very careful consideration
8
given the statistical handling of this variable. According to Boyle and Lennon (p.
178), "MANCOVAs were carried out on the data with gender and age as
covariates, in order to correct statistically for distorting effects due to those
variables. The main effect across groups was still significant ... after the effects of
age and gender were partialled out." Partialling out the variance of demographic
variables by treating them statistically as covariates clearly refutes Morey's claim
that demography and clinical status were confounded.
Morey also asserted that differing numbers of subjects in the cells of the
design was problematic for the discriminative validity of PAl scales. However,
Boyle and Lennon (p. 178) specifically pointed out that "in order to counteract the
effects of heterogeneity of variance due to unequal group sizes, a more stringent
criterion of statistical significance was used .... In addition . . . Bonferroni
corrections were applied to minimize finding significant between-group
differences due to chance alone." In any event, heterogeneity of variance due to
unequal group sizes should have increased the likelihood of finding significant
differences on the PAl scales. Since several of the scales still failed to discriminate
between the clinical groups, this further suggests inadequate discriminative
validities.
Morey attempted to downplay the finding that his factor analytic solution
for the clinical subjects was poorly replicated when identical "Little Jiffy"
procedures were employed (cf. Comrey & Lee, 1992). He suggested that a minor
typographical error in a single correlation coefficient was responsible for the
discrepant results. Thus, Morey suggested that, "I would encourage them [Boyle &
Lennon] to rerun their analyses correcting this value and I suspect they will
replicate the solution provided in the manual." Accordingly, the suggested minor
9
correction was made and the factor analysis rerun. This minor change made no
significant difference to the outcome (factor loadings, in general, differed only at
the third or fourth decimal place from the previous analysis by Boyle and Lennon).
Again, SPSS gave a warning message that the correlation matrix was "ill-
conditioned," that it was "not positive definite," and again, the Bartlett test of
sphericity could not be calculated, indicating that the matrix for the clinical sample
reported in Table 10.1 of the PAl manual fails to satisfy multivariate normality
assumptions required for a valid factor analysis.
Morey criticized Boyle and Lennon's factor analytic methodology, but his
own procedures and results were to some extent deficient. It is irrelevant how
many other studies Morey cites to support his higher-order factor solutions (e.g.,
Deisinger, 1995). Philosophical debate about theory and factor analysis and
seeking support from Costa and McCrae (1985) does not overcome the problem.
Indeed, the factor analytic work of Costa and McCrae itself leaves much to be
desired from a methodological and psychometric standpoint (see detailed critiques
by Block, 1995; and Boyle, Stankov, & Cattell, 1995, pp. 431-433).
Morey stated that "using a principal components-varimax factor
technique… does not imply that I believe psychopathological constructs are
orthogonal…" Nevertheless, construct validation of an instrument such as the PAl
is a detailed, extensive process, part of which includes consideration of its factor
analytic validity [see Grossarth-Maticek, Eysenck, & Boyle (1995) for a technical
discussion of construct validity in relation to personality instruments]. Morey's
higher-order factoring procedure unnecessarily precluded the possibility of
checking on the PAl factor validity.
10
Factor analytic procedures have been well discussed in several authorative
publications (e.g., Cattell, 1978; Comrey & Lee, 1992; Gorsuch, 1983; McDonald,
1985). Consideration of what constitutes appropriate factor analytic methodology
is critically important if valid factor solutions are to be obtained (cf. Boyle, 1988,
1993; Boyle & Stanley, 1986). For example, Morey recommended factor analyses
at the item-subscale level, yet it is well documented that item correlations are
notoriously unreliable. That is why Comrey (e.g., Hahn & Comrey, 1994) has
advocated the use of factored homogeneous item dimensions (FHIDs), Cattell
(1978) has recommended the use of item parcels, and Marsh (see Boyle, 1994) has
employed item-dyads as his smallest correlational units. As for Morey's call for
non-linear factor analytic procedures, the comments of Gorsuch (pp. 118-120) and
McDonald (1981) seem germane.
Finally, Morey concluded that "an unreflective application of criteria such
as ‘coefficient alpha is bad’ or ‘simple structure is good’ can be quite misleading
depending on the issue in question." Yet Boyle and Lennon never made such
dogmatic claims. It appears that Morey has misread the evidence in attempting to
deflect attention away from the empirically observed shortcomings highlighted in
Boyle and Lennon's study (cf. Boyle, 1995; and Boyle, Ward, & Steindl, 1994).
Discussion grounded more firmly in the empirical evidence, rather than on
supposition or semantics about the definition of terms such as moderator variables,
would do greater justice to the scientific issues.
References
Block, J. (1995). A contrarian view of the five-factor approach to personality
description. Psychological Bulletin, 117, 187-215.
11
Boyle, G. J. (1979). Delimitation of state-trait curiosity in relation to state anxiety
and learning task performance. Australian Journal of Education, 23, 70-82.
Boyle, G. J. (1983). Effects on academic learning of manipulating emotional states
and motivational dynamics. British Journal of Educational Psychology, 53,
347-357.
Boyle, G. J. (1985). Self-report measures of depression: Some psychometric
considerations. British Journal of Clinical Psychology, 24, 45-59.
Boyle, G. J. (1988). Exploratory factor analytic principles in motivation research.
In J. R. Nesselroade & R. B. Cattell (Eds.), Handbook of multivariate
experimental psychology (pp. 742-745). New York: Plenum.
Boyle, G. J. (1991). Does item homogeneity indicate internal consistency or item
redundancy in psychometric scales? Personality and Individual
Differences, 12, 291-294.
Boyle, G. J. (1993). Special review: Evaluation of the exploratory factor analysis
programs provided in SPSSX and SPSS/PC+. Multivariate Experimental
Clinical Research, 10, 129-135.
Boyle, G. J. (1994). Self-Description Questionnaire II. In D. J. Keyser & R. C.
Sweetland (Eds.), Test Critiques (Vol. 10, pp. 632-643). Kansas City, MO:
Test Corporation of America.
Boyle, G. J. (1995a). Measurement of intelligence and personality within the
Cattellian psychometric model. Multivariate Experimental Clinical
Research, 11, 47-59.
Boyle, G. J. (1995b). Review of the Personality Assessment Inventory. In J. C.
Conoley & J. lmpara (Eds.), Twelfth mental measurements yearbook.
Lincoln, NE: Buros Institute of Mental Measurements.
12
Boyle, G. J., & Cattell, R. B. (1984). Proof of situational sensitivity of mood states
and dynamic traits-ergs and sentiments-to disturbing stimuli. Personality
and Individual Differences, 5, 541-548.
Boyle, G. J., & Lennon, T. J. (1994). Examination of the reliability and validity of
the Personality Assessment Inventory. Journal of Psychopathology and
Behavioral Assessment, 16, 173-187.
Boyle, G. J., & Stanley, G. V. (1986). Application of factor analysis in
psychological research: Improvement of simple structure by computer
assisted graphic oblique transformation: A brief note. Multivariate
Experimental Clinical Research, 8, 175-182.
Boyle, G. J., Ward, J., & Lennon, T. J. (1994). Personality assessment inventory:
A confirmatory factor analysis. Perceptual and Motor Skills, 79, 1441-
1442.
Boyle, G. J., Stankov, L., & Cattell, R. B. (1995). Measurement and statistical
models in the study of personality and intelligence. In D. H. Saklofske &
M. Zeidner (Eds.), International Handbook of Personality and Intelligence.
New York: Plenum.
Cattell, R. B. (1973). Personality and mood by questionnaire. San
Francisco: Jossey-Bass. Cattell, R. B. (1978). The scientific use of factor
analysis in behavioral and life sciences. New York: Plenum.
Cattell, R. B. (1987). Intelligence: Its stmcture, growth and action. Amsterdam:
North Holland. Cattell, R. B. (1992). Human motivation objectively,
experimentally analyzed. British Journal of Medical Psychology, 65, 237-
243.
13
Cattell, R. B., & Child, D. (1975). Motivation and dynamic stmcture. London:
Academic. Cattell, R. B., & Kline, P. (1977). The scientific analysis of
personality and motivation. New York: Academic.
Comrey, A L., & Lee, H. B. (1992). A first course in factor analysis (2nd ed.).
Hillsdale, NJ: Erlbaum.
Costa, P. T., & McCrae, R. R. (1985). The NEO Personality Inventory manual.
Odessa, FL:
Psychological Assessment Resources. Deisinger, J. A (1995). Exploring the factor
structure of the Personality Assessment Inventory. Assessment, 2, 173-179.
Fernandez, E. (1990). Artifact in pain ratings, its implications for test-retest
reliability, and correction by a new scaling procedure. Journal of
Psychopathology and Behavioral Assessment, 12, 1-15.
Fernandez, E., Nygren, T. E., & Thorn, B. E. (1991). An open-transformed scale
for correcting ceiling effects and enhancing retest reliability: The example
of pain. Perception and Psychophysics, 49, 572-578.
Gardner, H. 0. (1993). Intelligence and intelligences: Universal principles and
individual differences. Archives de Psychologie, 61, 169-172.
Gorsuch, R. L. (1983). Factor analysis (rev. 2nd ed.). Hillsdale, NJ: Erlbaum.
Grossarth-Maticek, R., Eysenck, H. J., & Boyle, G. J. (1995). Method of test
administration as a factor in test validity: The use of a personality
questionnaire in the prediction of cancer and coronary heart disease.
Behaviour Research and Therapy, 33, 705-710.
Hahn, R., & Comrey, A L. (1994). Factor analysis of the NEO-PI and the Comrey
Personality Scales. Psychological Reports, 75, 355-365.
Helmes, E., & Reddon, J. R. (1993). A perspective on developments in assessing
14
psychopathology: A critical review of the MMPI and MMPI-
Psychological Bulletin, 113, 453-471.
McDonald, R. P. (1981). The dimensionality of tests and items. British Journal of
Mathematical and Statistical Psychology, 34, 110-117.
McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ:
Erlbaum.
Morey, L. C. (1991). The Personality Assessment Inventory Professional Manual.
Odessa, FL: Psychological Assessment Resources.
Morey, L. C. (1995). Critical issues in construct validation: Comment on Boyle
and Lennon (1994). Journal of Psychopathology and Behavioral
Assessment, 17, 393-401.
Stankov, L., Boyle, G. J., & Cattell, R. B. (1995}. Models and paradigms in
personality and intelligence research. In D. H. Saklofske & M. Zeidner
(Eds.), International handbook of personality and intelligence. New York:
Plenum.
Sternberg, R. J. (1994). Experimental approaches to human intelligence. European
Journal of Psychological Assessment, 10, 153-161.