On the robustness, bias, and stability of statistics from meta-analysis of correlation coefficients:...

15
Journal of Applied Psychology Copyright 1998 by the American Psychological Association, lnc. 1998, Vol. 83, No. 2, 164-178 0021-9010/98/$3.00 On the Robustness, Bias, and Stability of Statistics From Meta-Analysis of Correlation Coefficients: Some Initial Monte Carlo Findings Frederick L. Oswald University of Minnesota, Twin Cities Campus Jeff W. Johnson Personnel Decisions Research Institutes Each of several Monte Carlo simulations generated 100 sets of observed study correlations based on normal, heteroscedastic, or slightly nonlinear bivariate distributions, with one population correlation coefficient and true variance of 0. A version of J. E. Hunter and E L. Schmidt's (1990b) meta-analysis was applied to each study set. Within simulations, was accurate on average. ~ was biased; one would correctly conclude more than half the time that no moderator effects existed. However, cases of variation in h and especially in ~ indicated that results from individual meta-analyses could deviate substantially from what was found on average. Findings for these no-moderator cases offer applied psychologists some guidelines and cautions when drawing conclusions about true popula- tion correlations and true moderator effects (e.g., situational specificity, validity general- ization) from meta-analytic results. Like the correlation coefficient, odds ratio, or standard- ized effect size d, meta-analytic results are statistics. A major difference between them is that the former types of statistics are based on a set of measurements at the individual level, and meta-analytic statistics are based on a set of statistics at the study level. Because meta-analytic statistics are based on larger sample sizes than the individ- ual statistics that go into a meta-analysis, they are popu- larly assumed in the scientific community to have great accuracy and precision in estimating population parame- ters. We investigated the robustness, bias, and stability of meta-analytic results for restricted cases in which meta- analytic data have no moderator effects across studies and some of the individual study data sets deviate from bivariate normality. Hunter and Schmidt's (1990b) approaches to meta- analyzing correlation coefficients are widely adopted in the social sciences, particularly in applied psychology. Whether study correlations are corrected by using statisti- Frederick L. Oswald, Department of Psychology, University of Minnesota, Twin Cities Campus; Jeff W. Johnson, Personnel Decisions Research Institutes, Minneapolis, MN. An earlier version of this article was presented at the 1lth Annual Conference of the Society for Industrial and Organiza- tional Psychology, San Diego, CA, April 1996. We acknowledge and thank John Campbell, Paul Sackett, and Frank Schmidt for their helpful comments on drafts of this article. Correspondence concerning this article should be addressed to Frederick L. Oswald, Department of Psychology, University of Minnesota, N218 Elliott Hall, 75 East River Road, Minneapolis, Minnesota 55455. Electronic mail may be sent to oswa0019@ gold.tc.umn.edu. 164 cal artifact distributions or are corrected individually, a meta-analysis of correlation coefficients generates two statistics of primary import: ~ is the estimate of the popu- lation correlation coefficient, where the population is rep- resented by all samples across studies; crp ^ : is an estimate of the true variance associated with p and is somewhat more difficult to interpret. When #2 is "large enough" (as supported by the chi-square test of homogeneity, judg- ments of the size of credibility intervals, or the "75% rule"; cf. Hunter & Schmidt, 1990b; Koslowsky & Sagie, 1993), the resulting inference is that true variance remains in the variance of the observed sample correlations, above and beyond that explained by statistical artifacts such as sampling error, measurement unreliability, and range restriction. This true variance then leads to the inference that some subsets of studies have to-be-determined char- acteristics that affect the true population correlation be- tween x and y. When crp ^ 2 is "small enough," on the other hand, the inference is simpler: All variation in observed correlations can be attributed to statistical artifacts. In other words, the conclusion is that differences between studies and measures do not appear to have any influence on the overall population correlation coefficient once the effects of statistical artifacts are considered--a strong inference indeed. Given the many meta-analyses in psychology that report ~ and # ~ values and draw conclusions from them, it is important to understand how these statistics behave in a variety of situations that are likely to exist in real- world data. As with other parametric statistical proce- dures, meta-analysis requires making assumptions about data, which can in turn affect the accuracy of ~ and

Transcript of On the robustness, bias, and stability of statistics from meta-analysis of correlation coefficients:...

Journal of Applied Psychology Copyright 1998 by the American Psychological Association, lnc. 1998, Vol. 83, No. 2, 164-178 0021-9010/98/$3.00

On the Robustness, Bias, and Stability of Statistics From Meta-Analysis of Correlation Coefficients: Some Initial Monte Carlo Findings

Frederick L. Oswald University of Minnesota, Twin Cities Campus

Jeff W. Johnson Personnel Decisions Research Institutes

Each of several Monte Carlo simulations generated 100 sets of observed study correlations based on normal, heteroscedastic, or slightly nonlinear bivariate distributions, with one population correlation coefficient and true variance of 0. A version of J. E. Hunter and E L. Schmidt's (1990b) meta-analysis was applied to each study set. Within simulations,

was accurate on average. ~ was biased; one would correctly conclude more than half the time that no moderator effects existed. However, cases of variation in h and especially in ~ indicated that results from individual meta-analyses could deviate substantially from what was found on average. Findings for these no-moderator cases offer applied psychologists some guidelines and cautions when drawing conclusions about true popula- tion correlations and true moderator effects (e.g., situational specificity, validity general- ization) from meta-analytic results.

Like the correlation coefficient, odds ratio, or standard- ized effect size d, meta-analytic results are statistics. A major difference between them is that the former types of statistics are based on a set of measurements at the individual level, and meta-analytic statistics are based on a set of statistics at the study level. Because meta-analytic statistics are based on larger sample sizes than the individ- ual statistics that go into a meta-analysis, they are popu- larly assumed in the scientific community to have great accuracy and precision in estimating population parame- ters. We investigated the robustness, bias, and stability of meta-analytic results for restricted cases in which meta- analytic data have no moderator effects across studies and some of the individual study data sets deviate from bivariate normality.

Hunter and Schmidt's (1990b) approaches to meta- analyzing correlation coefficients are widely adopted in the social sciences, particularly in applied psychology. Whether study correlations are corrected by using statisti-

Frederick L. Oswald, Department of Psychology, University of Minnesota, Twin Cities Campus; Jeff W. Johnson, Personnel Decisions Research Institutes, Minneapolis, MN.

An earlier version of this article was presented at the 1 lth Annual Conference of the Society for Industrial and Organiza- tional Psychology, San Diego, CA, April 1996. We acknowledge and thank John Campbell, Paul Sackett, and Frank Schmidt for their helpful comments on drafts of this article.

Correspondence concerning this article should be addressed to Frederick L. Oswald, Department of Psychology, University of Minnesota, N218 Elliott Hall, 75 East River Road, Minneapolis, Minnesota 55455. Electronic mail may be sent to oswa0019@ gold.tc.umn.edu.

164

cal artifact distributions or are corrected individually, a meta-analysis of correlation coefficients generates two statistics of primary import: ~ is the estimate of the popu- lation correlation coefficient, where the population is rep- resented by all samples across studies; crp ^ : is an estimate of the true variance associated with p and is somewhat more difficult to interpret. When #2 is "large enough" (as supported by the chi-square test of homogeneity, judg- ments of the size of credibility intervals, or the "75% rule"; cf. Hunter & Schmidt, 1990b; Koslowsky & Sagie, 1993), the resulting inference is that true variance remains in the variance of the observed sample correlations, above and beyond that explained by statistical artifacts such as sampling error, measurement unreliability, and range restriction. This true variance then leads to the inference that some subsets of studies have to-be-determined char- acteristics that affect the true population correlation be- tween x and y. When crp ^ 2 is "small enough," on the other hand, the inference is simpler: All variation in observed correlations can be attributed to statistical artifacts. In other words, the conclusion is that differences between studies and measures do not appear to have any influence on the overall population correlation coefficient once the effects of statistical artifacts are considered--a strong inference indeed.

Given the many meta-analyses in psychology that report ~ and # ~ values and draw conclusions from them, it is important to understand how these statistics behave in a variety of situations that are likely to exist in real- world data. As with other parametric statistical proce- dures, meta-analysis requires making assumptions about data, which can in turn affect the accuracy of ~ and

META-ANALYSIS STATISTICS 165

~ . Two assumptions we focus on in this article are those of linearity and homoscedasticity. We examined the ro- bustness of ~ and ~ by simulating meta-analyses based on data that violated linearity or homoscedasticity as- sumptions. We compared these meta-analyses with simu- lation results in the bivariate normal case, where the statis- tical assumptions of meta-analysis (or the statistical cor- rection components thereof) were met. Then, to gauge the performance of these estimators across conditions, we determined the bias and stability of these estimators. Ro- bustness, bias, and stability are defined and discussed below, followed by the details of our simulations.

Robustness

In psychological data, assumptions of linearity, homo- scedasticity, and normality are frequently unmet (Ghiselli, 1966; Greener & Osburn, 1980; Guion, 1965; Linn, 1968; Lord & Novick, 1968; however, see Coward & Sackett, 1990). For example, Lee and Foley (1986) reported, on the basis of a sample of 68,672 Navy recruits, that none of the six bivariate distributions they tested satisfied the assumptions of linearity or homoscedasticity. Micceri (1989), in his statistical analysis of 440 large-sample dis- tributions of ability, interest, personality, and criterion- mastery tests, found none of the univariate distributions to satisfy tests of normality. Multimodality and asymmetry characterized the distributions of these data, and the extent of non-normality appeared to have a practical significance that exceeded differences detectable under the statistical power of large-sample studies. Failure of these data sets to satisfy tests of univariate normality make it highly un- likely that stricter tests of bivariate normality would hold if applied to studies that correlate any two of these vari- ables in a meta-analysis.

The frequency and extent of non-normality in psycho- logical research data are not known with much precision. However, a sample of the extant theoretical and applied literature that directly examined the issue does describe nonlinear and heteroscedastic distributions (Fisher, 1959; Gross & Fleischman, 1983, 1987; Guion, 1965; Lee & Foley, 1986; Linn, 1968; Micceri, 1989; U.S. Department of Labor, 1979). In this article, three realistic types of such distributions were simulated: fan, football, and sig- moid distributions. Study data were based on these distri- butions, then study correlations were computed for subse- quent meta-analysis.

Fan and football distributions describe heteroscedastic x - y relationships. The conditional mean of y given x is linear, but the conditional variance of y differs systemati- cally with various levels ofx. For the football distribution, there is more variance in y at the midrange of x (i.e., greater accuracy in predicting y for very high and very low values of x and less accuracy around the midpoint).

Football distributions are found when test score data fit a binomial model better than a normal model (Lord & Novick, 1968). For the fan distribution, the variance of y increases toward the right tail o fx (i.e., greater accuracy of prediction for lower values of x). S. H. Brown, Stout, Dalessio, and Crosby (1988) found that, by computing the variances of a sales production measure within deciles along an aptitude measure, fan distribution described their data (i.e., the variances at the high end ofx were greatest).

Sigmoid distributions describe slightly nonlinearity in the tails of the x - y distribution but are homoscedastic (i.e., constant conditional variances of y given x across the range of x). For instance, there may be a certain level of general conscientiousness that is associated with maximally effective overall performance on the job, be- yond which there is no added performance benefit. Con- versely, a low level of general conscientiousness may be associated with minimally effective or ineffective overall job performance, even though lower levels of general con- scientiousness are measurable. In between these levels is a range of predictive measurement. In applied psychological data, a slight flattening at the tails of an x - y distribution is not uncommon, and the sigmoid distribution is one such example of this phenomenon.

Along with the fan, football, and sigmoid distributions, we also simulated the assumption-satisfying bivariate nor- mal case to provide a benchmark with which to compare the other distributions.

Roth and Sackett ( 1991 ) suggested that, compared with other inferential statistical procedures, meta-analysis may actually be more negatively affected by assumption viola- tions such as those created by the fan, football, and sig- moid distributions. This may be because in meta-analysis there are many statistical corrections applied to a set of indexes of association (e.g., correlations, effect sizes) across studies, and each correction has its own set of assumptions (e.g., linearity, heteroscedasticity). Because of these compound corrections, it is important to under- stand specific circumstances in which meta-analytic esti- mation procedures are robust, or are not robust, to the normality assumptions underlying them. In discussing the sampling error variance formula in meta-analysis, Hunter and Schmidt (1990b) broadly stated that "the statistics literature has found this formula to be fairly robust in the face of departures from normality" (p. 104). However, keep in mind that describing meta-analysis, or aspects thereof, as robust does not recommend indiscriminate ap- plication of a meta-analytic method to any data set; ro- bustness implies parameter insensitivity to a particular set or sets of conditions (Bradley, 1978; Wainer, 1982).

We investigated how particular within-study violations of distributional assumptions affect meta-analytic findings under particular study and population conditions. Simula- tions were initially framed within a personnel selection

166 OSWALD AND JOHNSON

context (e.g., range restriction effects were simulated), though the simulations potentially have wider application to other research domains in psychology where studies contributing to a meta-analysis violate linearity or homo- scedasticity assumptions in the manner simulated herein.

Bias and Stability

After simulating meta-analyses under several realistic distributional conditions, it will then be important to gauge the performance of the resulting estimators. The performance of jb and 8 2 was evaluated in two ways in this study. First, bias is the amount by which an estimator deviates from its true value on average. An unbiased esti- mator is equal to the population value on average (e.g., E(~ ) = p). Second, stability measures the variability of an estimator around its average (and possibly biased) value. There are various ways to measure stability; we report the standard deviation of the estimators across 100 meta-analytic replications within each condition. Several studies that have averaged meta-analytic results over repli- cations did not report standard deviations or empirically derived standard errors of their estimates across replica- tions (e.g., Callender, Osburn, Greener, & Ashworth, 1982; Hunter & Schmidt, 1994; Law, Schmidt, & Hunter, 1994; Mendoza & Reinhardt, 1991), though some have (e.g., Cornwell & Ladd, 1993; Osburn & Callender, 1992; Switzer, Paese, & Drasgow, 1992). The standard deviation provides an important additional piece of information-- an estimator may be unbiased, where its expectation is equal to the population value, but the estimator may also be unstable, where in any given sample it has a nontrivial chance of being highly discrepant from the population value.

Reporting bias and stability independently gives more information about the accuracy of estimators than would reporting the mean squared error, which is a composite of the two (cf. Lindgren, 1993). The frequency with which jb and ~ 2 estimates fall outside different intervals around their true population values also is reported (e.g., how often is ~ outside of the interval p ___ .05?). This frequency count combines both bias and stability concerns, as would testing for the incidence of Type I or Type II errors or calculating mean squared error. However, it also shows how often meta-analytic estimators are discrepant from their true values by practically meaningful, and not just statistically significant, amounts (cf. Cohen, 1994). Our results give an idea of some critical values for the distribu- tions of ~ and 82 under the null hypothesis that no moder- ator effects are operating across study correlations. Such critical values have been used, but not reported, in prior studies of meta-analysis.

Monte Carlo Simulations

The present Monte Carlo simulations offer the advan- tage of controlling several desired statistical properties. First, all constituent single-study subpopulation correla- tion coefficients, Pi (where i = 1 . . . . . k), were equal within each meta-analysis. In other words, subpopulation correlations were not differentially influenced by moderat- ing characteristics across studies; therefore the overall population correlation coefficient, p, had an associated true variance, c%, ^ 2 of 0. Second, violations of distribu- tional assumptions in within-study data were induced, not inferred, providing a direct test of their effects on meta- analysis. Finally, statistical artifacts were selected and in- corporated into the study samples premeditatedly, based on statistical artifact distributions that have been used i n a number of other studies (cf. Kemery, Mossholder, & Roth, 1987; Koslowsky & Sagie, 1994; Pearlman et al., 1980; Raju, Burke, Normand, & Langlois, 1991). Arti- facts that were clearly not present in this study were also known (e.g., dichotomization of the independent variable, imperfect construct validity, transcription errors; Hunter & Schmidt, 1990b).

The present research contributes to the meta-analysis literature in the following ways:

1. We address the call of Mendoza and Reinhardt (1991) and Spector and Levine (1987) to examine the robustness of meta-analytic estimators. We shall risk stat- ing the obvious because of previous misinterpretations of the meaning of robustness: Results must be qualified by our particular simulation conditions. This is a limitation only in that all possibilities regarding robustness could not be examined. We selected boundary conditions that, for the purposes of our study, were realistic, conservative, or otherwise reasonable.

2. We adopt Hunter and Schmidt's (1990b) meta-anal- ysis method for correcting individual correlations instead of the method for correcting the mean and variance of the distribution of observed correlations with the mean and variance of the distribution of available statistical arti- facts. Many meta-analysis simulations have adopted the latter method (e.g., Callender et al., 1982; Comwell & Ladd, 1993; Law et al., 1994; Mendoza & Reinhardt, 1991; Roth & Sackett, 1991), and it is likely used more often in real-world meta-analyses, because the mean and variance of the distribution of study correlations are cor- rected by the accumulated distributions of reliability coef- ficients and range restriction information that is reported and available across studies. In our Monte Carlo simula- tions, however, all studies have known statistical artifact information, so all study correlations can be corrected individually, a more accurate approach to meta-analysis that is less confounded with our research questions on robustness.

META-ANALYSIS STATISTICS 167

3. This study is a first step in descr ibing propert ies o f par t icular null dis t r ibut ions for ~ and crp ^ 2 based on bivari-

ate normal data. Regarding ~, there is vir tual ly uni form agreement that these est imates are accurate on average. But few studies have documented the variat ion in ~ that occurs across s imilar meta-analyses . Regarding ^ 2 a o, sev- eral meta-analyt ic s imulat ion studies have repor ted Type I and II error rates for deciding whether modera tor effects exist under different meta-analyt ic methods and condit ions (Cornwel l & Ladd, 1993; Kemery et al., 1987; Osburn, Callender, Greener, & Ashworth , 1983; Roth & Sackett , 1981; Sackett , Harris , & Orr, 1986; Sagie & Koslowsky, 1993; Spector & Levine, 1987), but no studies provide the actual cutoff values at par t icular points on the null dis tr ibut ion o f ^ 2 a p across repl icat ions (e.g., at the 95th percent i le ) . As with reference tables for the Student ' s t distr ibution, dis tr ibut ions for ~ and crp ^ 2 could be deter- mined and tabled, giving ranges o f values and their associ- ated probabi l i ty of occurrence under different null hypoth- eses, to help determine the extent and severity of Type I or Type II errors under different statistical tests and rules o f thumb. F rom a pract ical standpoint, we may learn f rom simulating meta-analyses that the empir ica l ranges of the probable values for j3 or ~ is either ( a ) so small, because o f the combining of sample sizes across studies, that cer- tain null hypotheses are rarely suppor ted even when the average value of ~ or ~ 2 (depending on the null hypothe- sis) is c lose to 0 or (b ) so large that making a Type II error is a lmost inevitable; for example , when inferring that modera tor effects are not present in a set of studies when in fact they are. Exist ing meta-analyt ic tests tend to make y e s - n o inferences about the presence of moderators (e.g. , the Q statistic, the credibi l i ty interval of p con- taining zero, or Schmidt and Hunter ' s " 7 5 % r u l e " ; cf. Kos lowsky & Sagie, 1993) and consequent ia l ly downplay, f rom both a statist ical and pract ical standpoint, how severe result ing Type I or Type II errors of inference may be. In the present s tudy we repor t informat ion on the empir ica l ranges of values for k and ~rp ^ z within different meta-ana- lyric condit ions, a relat ively clear and unders tandable way to begin addressing the severi ty-of-errors issue.

M e t h o d

Methodologically, this study used the random variable ap- proach to populations. This asymptotic situation does not differ from the case of having a very large pool of potential job appli- cants (e.g., the set of all possible applicants who could take an ability test as part of a personnel selection process). Four types of bivariate distributions were simulated for each of three levels of true score correlations indexing the latent predictor-criterion relationship in a large population of potential job applicants (p = .25, .50, and .75). All study data were scaled so that #x = ~y = 50 and ~rx = ay = 10. We obtained x from a random- number generator with a univariate normal distribution. For each

of the four distributions, y was computed with a different ex- pected conditional mean,/.,y Ix, and conditional standard devia- tion, O-y Ix" We adapted the following formulas from Greener and Osburn (1980).

Bivar ia te N o r m a l D i s t r ibu t ion

The bivariate normal distribution was generated by setting (a) izrlx = #y + p(x - /~x); (b) crylx = Cry(1 - p2)1/2. The bivariate normal distribution satisfied all distributional assump- tions required of these meta-analyses (or formulas thereof) and served as a basis of comparison for the three distributions that follow.

F a n Dis t r ibu t ion

The fan distribution was generated by setting (a) #ylx = /~r

+ p(x - #D; (b) aylx = Oy(1 - p 2 ) 1 / 2 ( 0 . 5 + j - 1), where 9

j , ranging from 1 to 10, denotes thej th interval ofx . Therefore, the conditional standard deviation of y on x has a range from 0.5 to 1.5 times the unconditional standard deviation of y. The fan distribution is linear but violates homoscedasticity assump- tions by increasing the conditional variance of y such that the data "fan out". as x increases.

F o o t b a l l D i s t r ibu t ion

The football distribution was generated by setting (a) #y ix = [£y -t- p ( X -- # x ) ; (b) [ ~ y l x = Cry(1 - p2)1/2[0.5 + 0.25(j - 1) - ( j > 5 ) (0 .5 ( j - 6) + 0.25)], where j , ranging from 1 to 10, denotes the j th interval of x. ( j > 5) is a logical statement that equals 1 if true, 0 if false. The conditional standard deviation of y given x has a range of 0.5 to 1.5 times the unconditional standard deviation of y. The football distribution is also hetero- scedastic but, like a football shape, the conditional variance of y is greatest at the mean of x, tapering off at both ends.

S i g m o i d Di s t r ibu t ion

The sigmoid distribution was generated by setting (a) #y Ix = #y + /3(x - #x) + A(x - #x) 3, where/3 = p + 0.05 and A

-0.05 = 3~r2 ; (b) Oylx = Cry(1 - - f l 2 ) 1 / 2 . The sigmoid distribution

is homoscedastic but departs from linearity assumptions by flat- tening the tails of the bivariate distribution. A very slight S shape describes the function of the expected mean value of y given x. Scatterplots of study data under these four distributions are available from Frederick L. Oswald.

For each of 12 conditions (4 bivariate distributions x 3 p levels), we generated and meta-analyzed four sets of simulated studies (k = 10, 20, 50, 100). Applicants were drawn from the population such that the desired number of incumbents for each study was obtained after direct range restriction was applied (i.e., N = 50, 100, 200, or 500; see Table 1). The expected median sample size for these simulations was 100, larger than the 68 often cited as the median sample size in empirical valida- tion studies (Lent, Aurbach, & Levin, 1971), reflecting the hope that contemporary researchers in applied psychology will be-

168 OSWALD AND JOHNSON

Table 1 Distributions o f Predictors and Criterion Reliability, Range Restriction, and Sample Size

Predictor Criterion reliability reliability Range restriction a Sample size

r~, p rrr p u SD p N b p

.90 .1875 .90 .03 1.000 1.00 .05 50

.85 .3750 .85 .04 .701 .70 .11 100

.80 .1875 .80 .06 .649 .60 .16 200

.75 .1250 .75 .08 .603 .50 .18 500

.70 .0500 .70 .10 .559 .40 .18

.60 .0500 .65 .12 .515 .30 .16

.50 .0250 .60 .14 .468 .20 .11 .55 .12 .411 .10 .05 .50 .10 .45 .08 .40 .06 .35 .04 .30 .03

p-weighted mean .8088 .6000 .5945 (.48) 172.5

.25

.35

.25

.15

Note. p denotes the probability of being sampled from the artifact distribution. Adapted from "Validity Generalization Results for Tests Used to Predict Job Proficiency and Training Success in Clerical Occupa- tions," by K. Pearlman, E L. Schmidt, and J. E. Hunter, 1980, Journal of Applied Psychology, 65, Tables 1-3, pp. 375-376. In the public domain. a Number in parentheses denotes the u-equivalent selection ratio, where u is the ratio of estimated unrestricted to range-restricted standard deviations of x. b Sample size of incumbents, obtained after direct range restriction on x.

come more aware of the impact of sampling error when design- ing criterion-related validation studies, being interested in ob- taining both statistical power and publishable findings (N. Schmitt, personal communication, February 1996). This hope may be less well founded in other specialties within industrial and organizational psychology (Mone, Mueller, & Mauland, 1996). In any case, results from the present simulations should be evaluated with these larger study sample sizes in mind. Meta- analytic results are less stable for meta-analyses based on smaller sample size studies, all else being the same.

All simulated x and y scores were independent across studies within each meta-analysis. By sampling independently from the predictor and criterion reliability distributions of Table 1, ran- dom error scores in the predictor and criterion were added to true scores xt and yt for each group of incumbents:

.~= 50(1 - r ~ ) + x,r'f-r-r~, with SO of 10~/1 - r~ (1)

y = 50(1 - r~yy) + yt r~yy, with SD of 10~/1 - rvy. (2)

Reliability values were essentially based on very large popula- tions, as one might find in a test manual. Sample-specific reli- abilities may be more appropriate when one suspects modera- tors, but that issue is not taken up here, where no-moderator conditions are examined and population reliabilities give more accurate results (Roth & Sackett, 1991).

After the sampling error variance, measurement unreliability, and direct range restriction artifacts had been applied to true score job incumbent data sets from one of the four distributions above, we used the Hunter and Schmidt meta-analytic procedure for correcting correlations individually (Hunter & Schmidt,

1990b, Ch. 3). As an important aside, validities were corrected for by measurement error in the predictor. Personnel psycholo- gists, though, often want t o know the operational validity of a selection measure. Operational validities reflect that latent criterion constructs of interest (e.g., overall job performance) are predicted with fallible predictor measures used in selection. Given a particular selection measure, its operational validity is estimated from a meta-analysis by attenuating the estimated true or latent population validity by that measure's reliability.

Resulting estimates of interest were the estimated population correlation between the true scores, ~, and the estimated " t r u e " variance of the population correlation, ~ . These estimates are accurate to the extent that ~ equals the actual value p, and ^ 2 O-p

equals 0. That is, the no-moderator case was simulated, where all observed study correlation coefficients have corresponding subpopulation correlations Pi equal to the overall population correlation p. The subpopulation-population distinction should always be made but is obviously more important when a meta- analysis detects moderator effects. Only by the benefit of simula- tion is it known a priori that moderator effects do not exist.

We replicated meta-analyses 100 times for each p and k condi- tion within each of the four distributions to better understand

^2 the bias and stability of ~ and crp. ^ 2 The average ~ and e p across replications indicate the bias of these estimators (i.e., the aver- age accuracy across meta-analyses within each condition), and the standard deviations indicate their stability (i.e., the disper- sion of obtained values across meta-analyses within each condi- tion). We calculated additional variability measures for the bi- variate normal case, to understand more about the accuracy of

and ~-~ when all distributional assumptions are satisfied. These

M E T A - A N A L Y S I S S T A T I S T I C S 169

measures were calculated not only for the cases in which sample sizes varied across studies, as in the previous simulations (57 = 172.5 ), but also for additional meta-analyses in which the sam- ple size per study was constant (all Ns = 50, 100, 200, or 500). This way, the influence of increasing the sample size was separated from the influence of increasing the number of studies. All computations were in the SAS (1995) programming language.

Results

B i a s o f ~ a n d cr p ~ 2

For the bivariate normal case, the Hunter and Schmidt meta-analytic method (1990b) estimated p accurately on average (see Table 2). For each p and k condition, the magnitude of the bias in ~ was no larger than .007 across 100 meta-analyses. The true value of p was always cap- tured in 95% confidence intervals constructed around the mean/3. (The Appendix provides justification for using confidence intervals.) Results were accurate for the fan and football distributions as well (see Tables 3 and 4), although the trend in the bias was a very slight but consis- tent underestimation of p. Bias ranged from - .004 to - . 018 across conditions. For the sigmoid distribution, the bias was slightly greater in absolute terms, ranging from - . 019 to - . 034 across conditions, with an average bias of - .026. For all conditions across these four distributional types,/3 always had considerably less bias than the differ- ence between the uncorrected mean correlation, r', and p (cf. Greener & Osburn, 1980; Hunter & Schmidt, 1990b). Bias in/3 appeared to be independent of the number of studies a meta-analysis contained.

Regarding &~, its true value was 0 across all meta- analytic conditions because there were no moderator ef- fects. Therefore, the average ^ ~ ~rp across 100 replications represented the overall bias in & 2 - - i f the average ^ 2 cr p was 0, then the bias was 0. For all four distributions, the bias was negative across most conditions. Ninety-five percent confidence intervals around these negative values did not usually capture the true value, and the pattern in which this occurred did not appear to depend on distributional type, p, or k. It should be noted that these negative average variance estimates are not miscalculations; they are analo- gous with the negative variance estimates sometimes ob- tained in analysis of variance models (cf. Hunter & Schmidt, 1990b). When the negative Values were set to 0, as is sometimes done in practice, bias became positive, and stability always increased (see Tables 2 - 5 ) .

To understand more clearly the extent of the negative bias found (or the extent to which negative values were set to 0), we report in Table 6 the percentage of negative values of &~ for p = .50, with all meta-analyses based on bivariate normal study data with fixed N and k. The last row summa- rizes meta-analyses using sample size distributions from

("4

r~

t~

I

I

~ ' O ~

I

¢e) 1.~ t-, q

I

I

I

~ e q ~ I"

I

I'

I

I

I

I

I

I

r-:. . ~ . . I I

~'q t'q

I I

V I

~'3 r¢3

I

~q

I

I

I I

I I

~ e q

I I

%

I I

_= r.

eq~

o

G~

O

i . ~o

>

..=

e~

8

~e

O e~

O

%

o

t ~

a

r~

r ~

I

I

I

I"

I

i¢~ tr~ c-4

I

I

I ] V

I I

~ LP~

~ t '~

I I

I I

I I

I

~b

p.

8

C~

'S

p.

.=.

0

c~

o

p. e-

£

~J

a

c~

~D

I

I

I

I

[

I

I

I

I

I

I

[

I

I

~ " o ~ I

~§~

c-,t t"q

I

I

~-~ ~o

I

[

I

I

c~ [

c b

p-

.=.

8

'S

- i

c~

i .

q~

8

0

e- Q

M E T A - A N A L Y S I S S T A T I S T I C S 171

Table 1, where sample sizes were not fixed across studies and average N was 172.5. Across all conditions, ^ z O'p w a s

negative more than the 50% one would expect if there were no bias. In Table 6 the mean value was about 66%, and the range was 53-78%• The percentage of negative values did not appear jointly or marginally dependent on N and k, but percentages did increase with larger values of p, averaging about 75% when p equaled .75•

Stability of ~ and ~ ~2

Insight into the stability of k and ~ 2 is gained by calcu- lating and comparing across conditions their standard de- viations based on 100 meta-analytic replications• Avail- able computer time allowed 100 meta-analyses to be run within each p and k condition, so the standard deviations are based on 100 particular meta-analytic values of ~ and & 2. The precision of the standard deviation should be high enough to allow rough assessment of the relative differ- ences in the stability of both estimators (a) between distri- butions; (b) between levels of the population correlation; and (c) between the number of studies, k.

Stability results were similar across all four distribu- tions (see Tables 2 - 5 ) . The stability of both ~ and cr~ ^ 2 improved with increasing p; higher p values led to more accurate results for any individual meta-analysis. Stability also improved when more studies were added to a meta- analysis, though the benefit was not as great when a meta- analysis had many studies to begin with (i.e., the standard deviation of estimators marginally decreased as k in- creased). Across all four distributions, ~ was much less stable than uncorrected r-, as a rule. The standard deviation of ~ was at least twice as large as the standard deviation of uncorrected r 'when p = .25, about 50% greater when p = .50, and approximately equal when p = .75. In other words, there is a psychometric price paid for correcting the data for statistical artifacts to decrease systematic error in r-: The unsystematic error increases, although it is pro- portionately less than the decreased bias of ~ (Bobko, 1983; Hunter & Schmidt, 1990b).

Taking the bivariate normal case as a straightforward example, ~ was virtually unbiased but was less stable than uncorrected r. ~ had greater variability but was cen- tered on the true value p, and r, though with less associ- ated variability, was centered on a gross underestimate of p. r w a s more stable, but its value was less than p across all conditions, to be expected because of the statistical artifacts introduced into the true score study data. The stability of ~ tended to converge with the stability of r with increasing values of p, with more studies in a meta- analysis, and with less statistical artifact correction ap- plied to the data. Table 2 shows that in the bivariate normal simulations, where all statistical assumptions were met, any ~ from an individual meta-analysis could have been

~ • II

~3

"~.

r~

H

II

t ~

II

II

e n ~ I

I

t '-I t~l ~,1

I

I

I

I

I

I

I

I

I

I

I

I

I

~D

I

I t

I

I I

I I I

I I I

I

=

"-a

..=

e~

8

O e~

0 O

..=

8

O

O

O

3

172 OSWALD AND JOHNSON

Table 6 Percentages of Meta-Analyses With Values of 62 < Of or p = .50, Bivariate Normal Data

N 10 20 50 100

50 70 53 72 62 100 68 75 64 58 200 69 66. 66 64 500 64 60 62 78

/V= 172.5 71 61 62 67

Note. True value of cr 2 = 0 across meta-analytic conditions. Percent- ages based on 100 replications for each (N, k) condition with bivariate normal study data.

considerably off the mark in estimating p, especially for low values of p and k.

Tables 2 - 4 show that stability characteristics for &2 were quite similar to those of ~ across distributions, sug- gesting robustness across the particular bivariate distribu- tions we simulated. The notable exception was the stabil- ity of ~2 for the sigmoid distribution (Table 5), where its standard deviation across replications was much larger than in the other three distributions, both in relative and in absolute terms. Although the sigmoid exhibits slight departures from linearity that are often not detectable, either by an F test for linearity of regression (Greener & Osburn, 1980) or by a scatterplot, the sigmoid distribution resulted in much less stability in ~p ^ 2 than for the bivariate normal, fan, and football cases. In the next section, stabil- ity issues are further discussed for the bivariate normal case alone, although it should be recognized that even slight nonlinearities engender greater stability problems for ~ 2 values.

Combining Bias With Stability: The Absolute Accuracy of ~ and c~p'2 in the Bivariate Normal Case

Tables 7 and 8 focus on the absolute accuracy of and ~2 for the assumption-satisfying bivariate normal case with no moderators and p = .50.1 Because substantive psychological theory and application help determine how far meta-analytic estimates can deviate from their true values before it becomes a real problem, information at different levels of accuracy is provided. Table 7 shows the percentage of meta-analyses within each (N, k) condi- tion that fell outside ranges of p ___ .025, p ___ .05, and p _ .10 (i.e., spans of .05, .10, and .20 correlation units, respectively). For example, with k = 20 and N = 100, 57% of meta-analyses had values of/3 that were beyond the p ___ .025 range; 22% had values beyond the p ___ .05 range. More liberal ranges, of course, had lower percent-

ages of outliers. In general, higher values of both N and k led to greater absolute accuracy. In fact, N X k served as a good informal standard for acceptable accuracy levels for ~, given our simulations with no moderator effects. (The average N cases had results that were in line with the fixed N cases. Therefore, average N could be used in place of N.) For example, only one study was outside the range of p ___ .05 when N X k >-- 10,000. On the other hand, when N x k -< 1,000, then the variation in ~ was relatively large.

Again, we replicated each meta-analytic condition 100 times, creating a distribution of ~ for each condition. The lower section of Table 7 lists the symmetric intervals be- tween which 95% of the ~ values fell. These values, or more extreme values, would be obtained under similar meta-analytic conditions about 5% of the time. The 95% intervals were larger with smaller k and N. When N x k < 2,500, the intervals were greater than .15, fairly wide by most standards. In contrast, the range of the interval was .04 when k = 100 and each study N was 500.

Similar trends for ^ 2 crp are found in Table 8. crp ^ 2 is used in meta-analysis to establish credibility intervals--inter- vals built around ~ to describe the range, and by inference the extent, of moderator effects not due to statistical arti- facts. Wide credibility intervals suggest that moderator effects exist, but what is "wide enough" has largely re- mained unspecified in the literature, except to say that generalizability may not hold when the credibility interval contains a correlation value of 0 (cf. Whitener, 1990). Statistical tests simply indicating that moderators exist do not speak to the magnitude of the moderator effects the way the magnitude of a credibility interval width would. All of our simulations were based on one overall popula- tion correlation with no moderator effects, so the true value of ~rp ^ 2 = 0, and the 90% credibility "interval" is of zero width. &2 was negative in more than 50% of the cases within meta-analytic conditions, where a zero credi- bility interval width would be correctly inferred from the statistic. However, Table 8 gives additional information beyond what happens on average. In most of the bivariate normal conditions with p = .50, about 20%-40% of meta- analyses have credibility intervals greater than. 10 correla- tion units wide, and about 10%-30% of meta-analyses have credibility intervals greater than .20 correlation units wide. These percentages increased when N x k was smaller. They also increased when p = .25 and k was small, and they decreased when p = .75 and k was large.

The lower section of Table 8 presents the values at the upper fifth percentile of the distribution of 90% credibility

1 Space limitations permit presentation of tabled results on accuracy only for p = .50, and it is noted in the text when results systematically differ for other values of p. More detailed simulation results are available from Frederick L. Oswald.

META-ANALYSIS STATISTICS 173

Table 7 Variability o f ~ Across Meta-Analyses for p = .50, Bivariate Normal Data

Percentages of meta-analyses with ~ outside the ranges p = .025, p = .05, and p ± .10 (ranges = .05, . 10, and .20)

10 20 50 100

N .05 .10 .20 .05 .10 .20 .05 .10 .20 .05 .10 .20

50 100 200 500

N = 172.5

73 51 28 76 48 10 47 23 2 34 7 0 69 45 9 57 22 1 31 7 0 25 0 0 45 21 2 41 14 1 20 0 0 11 0 0 37 8 0 18 1 0 5 0 0 2 0 0

61 26 4 49 22 1 26 1 0 16 1 0

Values of ~ at the endpoints of the empirical 95%intervM

k

10 20 50 100

LB UB LB UB LB UB LB UB

50 26 67 36 62 40 59 44 55 100 32 65 43 59 45 56 45 54 200 41 57 43 57 46 53 47 53 500 44 56 46 54 48 53 48 52

2q= 172.5 40 62 42 58 45 54 47 54

Note. 95% empirical intervals are symmetric, excluding 2.5% of meta-analyses at the upper and lower tails of the f~ distribution within conditions. Percentages are based on 100 replications for each (N, k) condition with bivariate normal study data. UB = upper bound of the empirical 95% confidence interval; LB = lower bound of the empirical 95% confidence interval.

interval widths, widths based on &2, for 100 meta-analyt ic repl icat ions within bivariate normal condit ions and p = .50. Al though the true credibi l i ty interval width was 0 across all condit ions, the upper 5% values ranged between .21 and .83 corre la t ion units, except for the two extreme meta-analyt ic condit ions with k -> 50 and N = 500 per study. The average interval width was .35 correla t ion units across all condit ions. In other words, 5% of the t ime the null dis tr ibut ions o f &2 and their associa ted credibi l i ty intervals could be quite large. We report the upper 5% values, but perhaps one would rather have higher l imit than 95% to make the strong inference that no moderators are operat ing in the study data, given the pol icy and re- search impl icat ions o f such a conclusion. The 99th per-

^ 2 would have resulted centi le of the ~rp ^ 2 dis t r ibut ion of ~rp in even larger credibi l i ty interval widths across meta-ana- lytic condit ions.

D i s c u s s i o n

The fol lowing discussion is in the context o f the Hunter and Schmidt (1990b) meta-analyt ic method for correct ing study correlat ions individually. Other methods under simi- lar s imulat ion condit ions could have y ie lded meta-analyt ic

est imators that have different robustness, bias, and stabil- i ty characterist ics. In the present no-modera tor simula- tions of the fan, football , and s igmoid distr ibutions, meta-

analytic popula t ion correlat ions ( ~ ) were rather robust but sl ightly b iased downward. Across all distr ibutions, including the bivariate normal distr ibution, the subpopula-

t ion variance est imates (~ 2p) were consistent ly negatively b iased (i.e., less than 0 ) , or were set to 0 more than 50%

of the time. Consistent with other meta-analysis studies (e.g., Kemery et al., 1987), this leads one to suspect increased Type II errors in condit ions where modera tor effects exist.

More important , across all distr ibutions, including the bivariate normal case, we found that individual meta-anal-

yses often y ie lded est imates of the popula t ion correla t ion that do not equal the true value and est imates o f the true subpopula t ion variance that did not equal 0. Somet imes these discrepancies were large in absolute terms, espe- cial ly with lower values o f N × k. The variat ion in ~ and &2 was also related to the par t icular statistical art ifact values for range restr ict ion and measurement unrel iabi l i ty that were randomly sampled f rom the art ifact distr ibu- tions. Relevant psychological theory, methodologica l de-

174 OSWALD AND JOHNSON

Table 8 Variability of ~2 Across Meta-Analyses for p = .50, Bivariate Normal Data

Percentages of meta-analyses with90%credibi l i ty i n~ rv~s gre~er than ~ = .05 and ~ ± .10 (ranges = .10 and .20)

10 20 50 100

N .10 .20 .10 .20 .10 .20 .10 .20

50 20 29 43 38 27 21 37 30 100 28 21 23 20 33 21 36 18 200 29 17 28 17 24 11 25 7 500 25 11 29 7 19 0 6 0

~q= 172.5 28 18 36 20 30 12 24 3

90% credibility interv~ range at the uppevfifth percentile of the ~ distribution

k

10 20 50 100

50 .83 .71 .40 .43 100 .57 .42 .32 .30 200 .30 .29 .23 .22 500 .22 .21 .16 .11

R = 172.5 .46 .36 .28 .20

Note. Ranges are in correlation units. True credibility interval range was 0 across meta-analytic conditions because the true value of cr~ was 0 (i.e., no moderator effects). Percentages are based on 100 replications for each (N, k) condition with bivariate normal study data.

sign, and quality of the meta-analysis, and the practical implications drawn from meta-analytic results, all would help determine whether the ~ and ~rp ^ 2 values are meaning- ful and useful and how much discrepancy in the estimators is tolerable.

Regarding the stability of meta-analytic results, consider in the bivariate normal case, where linearity and homo-

scedasticity assumptions were met. The average value of across 100 meta-analyses was accurate in these one-

population cases but had more associated variability than the uncorrected r-. Any single meta-analysis could have still resulted in a value of jb that was fairly discrepant from the population value. The standard deviation of ~ increased as the number of studies in each meta-analysis decreased. The potential for inaccuracy in ~ would have been even more pronounced for meta-analyses in which N x k was smaller than simulated here, all else being the same.

The high percentage of negative values of &2 under no- moderator conditions generally leads one to the correct conclusion: no moderator effects exist in the meta-analytic data. The negative bias in # 2 may also hold when modera- tors exist as well, however, increasing the incidence of Type II error (i.e., falsely concluding that the true variance is 0 and moderators do not exist in a set of meta-analytic

data). On the basis of our null-hypothesis findings, the 90% credibility interval (cf. Whitener, 1990) obtained in a meta-analysis of correlations with nonzero moderator effects may have to be quite large to infer with a reason- able level of confidence that there are moderator effects operating.

The variability found in meta-analytic estimators leads to two conclusions. First, for N X k -< 5,000, it appears that the use of credibility intervals may not be appropriate in cases where there are no moderator effects. An accu- rately estimated credibility interval "width" of 0 implies that subpopulation correlations could not deviate from the estimate of the population correlation, yet estimate of the population correlation itself could be inaccurate. Credibil- ity intervals, when accurate, may be more useful when built around population correlations whose accuracy is contained well within the interval itself. The present simu- lations had a nontrivial chance of obtaining very large nonzero values of #2, making the credibility interval highly inaccurate, although conservative in a sense. Unlike most hypothesis testing in the social sciences, it is our opinion that Type II errors are more serious than Type I errors in the search for moderator effects in meta-analysis. If the ~rp ^ 2 value leads an investigator to search for substan- tive moderators to explain variation across studies, this is better for applied psychology than thinking that no sub- stantive moderators could possibly change the results from meta-analysis. Problems with # ~ and the use and theoreti- cal interpretation of credibility intervals would likely be more complicated in cases where there was nonartifactual variance across subpopulation correlations (i.e., the het- erogeneous case, Hunter & Schmidt, 1990b) because of greater sampling variability within and nonrepresentative sampling across the study subpopulations.

In summary, any omnibus statement about the existence or nonexistence of moderator effects on an x - y relation- ship should, in most cases, be influenced by the theoretical and practical implications of making such a statement and not quantitatively by a credibility interval. No empirical data could justify an omnibus statement about moderator e f fec ts - -no data set can explore every possibility. Rather than relying entirely on credibility intervals or homogene- ity tests to dictate the search for moderators, substantive moderators of interest should be directly investigated and results interpreted in light of the relevant psychological context. Credibility intervals may not prove useful, except possibly to indicate that some studies within a meta-analy- sis differ in their true subpopulation correlation values in some unspecified manner. Future research on the use of credibility intervals should be compared with our findings to test this assertion further.

Second, variability in meta-analytic estimators was much greater when N x k was small, so meta-analytic conclusions should be tempered with the realization that such estimates

META-ANALYSIS STATISTICS 175

may be inaccurate in these situations. Having a small num- ber of studies in a meta-analysis is not uncommon, for example, after stratifying a set of studies into smaller sub- groups on the basis of moderator variables. An informal search of the Journal of Applied Psychology, Personnel Psychology, and Psychological Bulletin in the past year and a half found 137 meta-analytic correlations based on fewer than 20 study correlations and a smaller average sample size than the average of 172.5 in our robustness simulations (Tables 2 - 5 ) . Another case in which having a small number of studies per meta-analytic correlation is a distinct possibility is when many meta-analytic population correlations are combined to create a correlation matrix for the purpose of fitting an extended confirmatory factor analysis model or path analysis model (cf. Shadish, 1996; Viswesvaran & Ones, 1995). Many published studies that have attempted this forward-thinking methodological ap- proach have used meta-analytic correlations based on fewer than 20 study correlations (e.g., S. P. Brown & Peterson, 1993; Harris & Rosenthal, 1985; Hom, Caranikas-Walker, Prussia, & Griffeth, 1992; Peters, Hartke, & Pohlmann, 1985; Premack & Hunter, 1988; Schmidt, Huntel; & Outer- bridge, 1986), many with average sample sizes smaller than our mixed sample size cases with ?7 = 172.5. When the ~s in the meta-analytic correlation matrix have a small number of studies and small sample sizes with ~7 < 172.5, then estimates can deviate substantially from the desired population values to a greater extent than was reported in Table 2 and Tables 7 and 8 for bivariate normal data. Conducting subsequent factor analyses or path analyses with meta-analytic correlations would therefore be difficult or risky. Additional problems in dealing with meta-analytic correlation matrices include obtaining nonpositive definite correlation matrices (Bock & Petersen, 1975), estimating communalities, and estimating the standard errors of factor loadings or path coefficients (Viswesvaran & Ones, 1995).

Additionally, if nonzero population variance exists in any of the ~s (i.e., if there are moderator effects), then these direct paths involving ~ must be misspecified. One suggested remedy is to identify and include moderators in the model that subdivide studies so that meta-analytic correlations in each subset have a true variance of 0 (Viswesvaran & Ones, 1995). This may be impractical or infeasible, however, because the moderator effects across studies may not be shared, or cannot be identified, or both, because of insufficient study information. Overall, these problems may be viewed more as challenges than as impediments in the use of meta-analysis for testing theories by means of factor analysis or path analysis. At present, however, calculating correlations based on large samples from well-defined populations is the most desir- able option.

Limitations The many replications of simulated meta-analyses

within conditions gives one some confidence in the find-

ings presented, but a few caveats are in order. The Hunter and Schmidt (1990b) method for correcting correlations individually was not the only one we could have used to pursue our research question, although it is one of the few procedures that explicitly address all the statistical artifacts in which we were interested. Because researchers have used Hunter and Schmidt procedures in numerous meta-analytic studies, it was sensible for us to do the same. Even if our results are procedure dependent, they nonetheless are relevant to a large body of meta-analytic research. Alternative meta-analytic methods could be pit- ted against one another to further our understanding of the robustness of meta-analysis results and the extent to which different methods yield unbiased and stable results.

In addition, although these simulated studies were gen- erated from statistical artifact distributions that were meant to be realistic (e.g., higher modal predictor reliabil- ities than criterion reliabilities), these distributions may be modified. Other distributions may be possible or even probable, especially in light of the particular psychologi- cal domain being meta-analyzed. For instance, current general cognitive ability measures are typically more reli- able than are current leadership measures, so if leadership measures were used as predictors, then the mean value of that reliability distribution would be lower. Different statistical artifact distributions do not appear to affect the bias in results from various meta-analysis methods (cf. Mendoza & Reinhardt, 1991), unless the assumed distri- butions are incongruent with the actual distributions across studies (Paese & Switzer, 1988; Raju, Pappas, & Williams, 1989).

Future Research

For each simulated study that contributed to a meta- analysis, statistical artifacts were allowed to vary in a specified manner independently of one another. Though this design appears to reflect the behavior of real-world study data, future researchers may wish to consider and possibly simulate study data in which artifacts are instead correlated with one another (e.g., nonlinearity and hetero- scedasticity; Gross, 1982). Additional or alternative statis- tical artifacts could be introduced into these simulations as well, such as dichotomization in the criterion variable, as discussed by Hunter and Schmidt (1990a). Predictor dichotomization is possible as well but is uninteresting when the range is restricted on the dichotomized variable (i.e., all applicants have the same predictor score) and is unrealistic when it is not (i.e., selection on this variable is completely random). Another common, and therefore important, statistical artifact to consider is the effect of indirect range restriction, or incidental selection, where a positive bias in estimating p often occurs (Johnson & Sager, 1991) and the sampling error variance is often un- derestimated (Aguinis & Whitehead, 1997).

176 OSWALD AND JOHNSON

Future research may be productively directed at the study of other types of nonlinear and heteroscedastic dis- tributions in psychological data. Meta-analytic estimates in the present simulation were variable but only slightly biased downward for study data following the sigmoid distribution. Greater deviations from linearity, combined with direct range restriction, will influence meta-analyses of correlations in unpredictable ways, if one does not know beforehand the functional form of the nonlinearity along the entire relevant range of measurement (Gross & Fleischman, 1983, 1987). In their large-sample study re- lating biodata aptitude ratings with a sales productivity measure, S. H. Brown et al. (1988) commented that

With a sample size of 1,600 instead of the current 16,230, the curvilinearity [in their study] would not have been statistically significant. The impact seems trivial, but when minimal curvilinearity exists along with truncation, the po- tential for misleading conclusions is substantial. (p. 741 )

Their conclusions about the combined effects of range restriction and nonlinearity can be extended to the bias one might find in a meta-analysis of correlations based on nonlinear data. The small amount of bias found for the sigmoid distribution, which departed only slightly from linearity, is in line with this possibility. Therefore, investi- gating the biasing effects of meta-analyses of psychologi- cal data characterized by nonlinearity with different func- tional forms may prove useful.

Meta-analyses of studies with nonlinear effects should be conducted and interpreted carefully when the correla- tion coefficient is the statistic of interest. Nonlinearity may be less of a concern in some domains, such as in the relationship between general ability and overall job performance (Coward & Sackett, 1990). It may be more commonplace in other nonability domains, such as the relationship between extraversion and level of effective- ness in interdependent teams, which may be characterized by an inverted U (Murphy, 1996). Correction for direct range restriction in this case would yield an overestimate of the population correlation, but other types of nonlinear- ity with range restriction could yield underestimates. In any case, we recommend that researchers use statistical and graphical methods of exploration at the individual study level whenever possible, reporting the magnitude of nonlinear relationships alongside any correlations or other statistical indexes of linearity (cf. Tukey, 1977). This should become common practice to describe psychologi- cal relationships accurately, and in fact the practice is evolving and gaining popularity.

In developing methods to meta-analyze nonlinear data, two somewhat antithetical avenues of research present themselves: (a) developing methods for statistical correc- tion or transformation of different types of nonlinear data so that conventional meta-analytic approaches are appro-

priate and (b) developing meta-analytic methods for accu- mulating statistics other than the Pearson product -mo- ment correlation and its analogs, such as the nonlinear statistic ~72, for studies within psychological domains where the type of nonlinearity and the x and y measures are similar. This type of research may pay off in terms of more statistically powerful moderator detection in meta- analysis, a problem that exists even in the bivariate normal case (cf. Sackett et al., 1986).

References

Aguinis, H., & Whitehead, R. (1997). Sampling variance in the correlation coefficient under indirect range restriction: Implications for validity generalization. Journal of Applied Psychology, 82, 528-538.

Bobko, P. (1983). An analysis of correlations corrected for attenuation and range restriction. Journal of Applied Psychol- ogy, 68, 584-589.

Bock, D. R., & Petersen, A. C. (1975). A multivariate correction for attenuation. Biometrika, 62, 673-678.

Bradley, J. V. (1978). Robustness? British Journal of Mathe- matical and Statistical Psychology, 31, 144-152.

Brown, S.H., Stout, J. D., Dalessio, A. T., & Crosby, M.M. (1988). Stability of validity indices through test score ranges. Journal of Applied Psychology, 73, 736-742.

Brown, S. P., & Peterson, R. A. ( 1993 ). Antecedents and conse- quences of salesperson job satisfaction: Meta-analysis and assessment of causal effects. Journal of Marketing Research, 30, 63-77.

Callender, J. C., Osburn, H. G., Greener, J. M., & Ashworth, S. (1982). Multiplicative validity generalization model: Accu- racy of estimates as a function of sample size and mean, variance, and shape of the distribution of true validities. Jour- nal of Applied Psychology, 67, 859-867.

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.

Cornwell, J. M., & Ladd, R. T. (1993). Power and accuracy of the Schmidt and Hunter meta-analytic procedures. Educa- tional and Psychological Measurement, 53, 877-895.

Coward, W. M., & Sackett, P. R. (1990). Linearity of ability- performance relationships: A reconfirmation. Journal of Ap- plied Psychology, 75, 297-300.

Fisher, J. (1959). The twisted pear and the prediction of behav- ior. Journal of Consulting Psychology, 23, 400-405.

Ghiselli, E.E. (1966). The validity of occupational aptitude tests. New York: Wiley.

Greener, J. M., & Osburn, H. G. (1980). Accuracy of correc- tions for restriction in range due to explicit selection in hetero- scedastic and nonlinear distributions. Educational and Psy- chological Measurement, 40, 337-346.

Gross, A. L. (1982). Relaxing the assumptions underlying cor- rections for restriction of range. Educational and Psychologi- cal Measurement, 42, 795-801.

Gross, A. L., & Fleischman, L. E. (1983). Restriction of range corrections when both distribution and selection assumptions are violated. Applied Psychological Measurement, 7, 227- 237.

META-ANALYSIS STATISTICS 177

Gross, A. L., & Fleischman, L. E. (1987). The correction for restriction of range and nonlinear regressions: An analytic study. Applied Psychological Measurement, 11, 211-217.

Guion, R. M. (1965). Personnel testing. New York: McGraw- Hill.

Harris, M. J., & Rosenthal, R. (1985). Mediation of interper- sonal expectancy effects: 31 meta-analyses. Psychological Bulletin, 97, 363-386.

Hom, P.W., Caranikas-Walker, E, Prussia, G.E., & Griffeth, R. W. (1992). A meta-analytical structural equations analysis of a model of employee turnover. Journal of Applied Psychol- ogy, 77, 890-909.

Hunter, J.E., & Schmidt, E L. (1990a). Dichotomization of continuous variables: The implications for meta-analysis. Journal of Applied Psychology, 75, 334-349.

Hunter, J. E., & Schmidt, E L. (1990b). Methods ofmeta-analy- sis: Correcting error and bias in research findings. Newbury Park, CA: Sage.

Hunter, J. E., & Schmidt, F. L. (1994). Estimation of sampling error variance in the meta-analysis of correlations: Use of average correlation in the homogeneous case. Journal of Ap- plied Psychology, 79, 171-177.

Johnson, J. W., & Sager, C. E. ( 1991, April). The robustness of range restriction correction due to incidental selection. Poster presented at the Sixth Annual Conference of the Society for Industrial and Organizational Psychology, St. Louis, MO.

Kemery, E.R., Mossholder, K.W., & Roth, L. (1987). The power of the Schmidt and Hunter additive model of validity generalization. Journal of Applied Psychology, 72, 30-37.

Koslowsky, M., & Sagie, A. ( 1993 ). Detecting moderators with meta-analysis: An evaluation and comparison of techniques. Personnel Psychology, 46, 629-639.

Koslowsky, M., & Sagie, A. (1994). Components of artifactual variance in meta-analytic research. Personnel Psychology, 47, 561-574.

Law, K. S., Schmidt, F.L., & Hunter, J.E. (1994). A test of two refinements in procedures for meta-analysis. Journal of Applied Psychology, 79, 978-986.

Lee, R., & Foley, P. P. (1986). Is the validity of a test constant throughout the test score range? Journal of Applied Psychol- ogy, 71, 641-644.

Lent, R. H., Aurbach, H. A., & Levin, L. S. (1971). Research design and validity assessment. Personnel Psychology, 24, 247-274.

Lindgren, B. W. (1993). Statistical theory (4th ed.). New York: Chapman & Hall.

Linn, R. L. (1968). Range restriction problems in the use of self-selected groups for test validation. Psychological Bulle- tin, 69, 69-73.

Lord, E M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

Mendoza, J. L., & Reinhardt, R. N. (1991). Validity generaliza- tion procedures using sample-based estimates: A comparison of six procedures. Journal of Applied Psychology, 110, 596- 610.

Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156-166.

Mone, M. A., Mueller, G. C., & Mauland, W. (1996). The per- ceptions and usage of statistical power in applied psychology

and management research. Personnel Psychology, 49, 103- 120.

Murphy, K. R. (1996). Individual differences and behavior in organizations: Much more than g. In K. R. Murphy (Ed.), Individual differences and behavior in organizations (pp. 3 - 30). San Francisco: Jossey-Bass.

Osburn, H. G., & Callender, J. (1992). A note on the sampling variance of the mean uncorrected correlation in meta-analysis and validity generalization. Journal of Applied Psychology, 77, 115-122.

Osburn, H. G., Callender, J. C., Greener, J. M., & Ashworth, S. (1983). Statistical power of tests of the situational specificity hypothesis in validity generalization studies: A cautionary note. Journal of Applied Psychology, 68, 115-122.

Paese, P. W., & Switzer, E S. (1988). Validity generalization and hypothetical reliability distributions: A test of the Schmidt- Hunter procedure. Journal of Applied Psychology, 73, 267- 274.

Pearlman, K., Schmidt, E L., & Hunter, J. E. (1980). Validity generalization results for tests used to predict job proficiency and training success in clerical occupations. Journal of Ap- plied Psychology, 65, 373-406.

Peters, L. H., Hartke, D. D., & Pohlmann, J. T. ( 1985 ). Fiedler's contingency theory of leadership: An application of the meta- analysis procedures of Schmidt and Hunter. Psychological Bulletin, 97, 274-285.

Premack, S. L., & Hunter, J. E. (1988). Individual unionization decisions. Psychological Bulletin, 103, 223-234.

Raju, N.S., Burke, M.J., Normand, J., & Langlois, G.M. (1991). A new meta-analytic approach. Journal of Applied Psychology, 76, 432-446.

Raju, N. S., Pappas, S., & Williams, C. P. (1989). An empirical Monte Carlo test of the accuracy of the correlation, covari- ance, and regression slope models for assessing validity gen- eralization. Journal of Applied Psychology, 74, 901-911.

Roth, L., & Sackett, P.R. (1991). Development and Monte Carlo evaluation of meta-analytic estimators for correlated data. Journal of Applied Psychology, 110, 318-327.

Sackett, P. R., Harris, M. M., & Orr, J. M. (1986). On seeking moderator variables in the meta-analysis of correlational data: A Monte Carlo investigation of statistical power and resis- tance to Type I error. Journal of Applied Psychology, 71, 302-310.

Sagie, A., & Koslowsky, M. (1993). Detecting moderators with meta-analysis: An evaluation and comparison of techniques. Personnel Psychology, 46, 629-640.

SAS Institute, Inc. (1995). SAS language, Release 6.10 [com- puter software]. Cary, NC: Author.

Schmidt, E L., Hunter, J.E., & Outerbridge, A.N. (1986). Impact of job experience and ability on job knowledge, work sample performance, and supervisory ratings of job performance. Journal of Applied Psychology, 71, 432-439.

Shadish, W. R. (1996). Meta-analysis and the exploration of causal mediating processes: A primer of examples, methods, and issues. Psychological Methods, 1, 47-65.

Spector, P. E., & Levine, E. L. (1987). Meta-analysis for inte- grating study outcomes: A Monte-Carlo study of its suscepti-

178 OSWALD AND JOHNSON

bility to Type I and Type II errors. Journal of Applied Psychol- ogy, 72, 3-9 .

Switzer, E S., Paese, E W., & Drasgow, E (1992). Bootstrap estimates of standard errors in validity generalization. Journal of Applied Psychology, 77, 123-129.

Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.

U.S. Department of Labor. (1979). Manual for the USES Gen- eral Aptitude Test Battery: Section III: Development. Wash- ington, DC: U.S. Government Printing Office.

Viswesvaran, C., & Ones, D. (1995). Theory testing: Combin- ing psychometric meta-analysis and structural equations mod- eling. Personnel Psychology, 48, 865-885.

Wainer, H. (1982). Robust statistics: A survey and some pre- scriptions. In G. Keren (Ed.), Statistical and methodological issues in psychology and social sciences research (pp. 187- 214). Hillsdale, NJ: Erlbaum.

Whitener, E. (1990). Confusion of confidence intervals and credibility intervals in meta-analysis. Journal of Applied Psy- chology, 75, 315-321.

Appendix

The Use of Confidence Intervals

The confidence intervals (CI) used in these simulations are not to be confused with CIs for a single meta-analysis. CIs in each of the present simulations were calculated on the mean values of ~ and ~ across 100 meta-analyses based on the same conditions (i.e., same distributional type, population correlation, number of studies, and statistical artifact distributions). CIs for a single meta-analysis, in contrast, are calculated on the uncorrected sample-weighted mean correlation (Whitener, 1990). All 95% CIs in the present article are based on the formula

SX CI.9s = -~ -+ t.025 (df = 99) _"r:~_,

,/100

where ~" is either mean ~ or mean ~ , and sx is its standard deviation.

Note that/3 and ~ are themselves weighted means, allowing the Central Limit Theorem to be invoked (cf. Lindgren, 1993). This theorem asserts that the distribution of the means of a sampling distribution closely approximates normality with small to moderate sample sizes, even when non-normality exists within samples (e.g., skewness in a sample of study correlation coefficients that index a single population correlation coefficient that is not near the -1 or 1 endpoints). Therefore, symmetric CIs about mean ~ and mean ~ can be constructed.

Rece ived S e p t e m b e r 16, 1997

Rev i s ion r ece ived Oc tobe r 16, 1997

A c c e p t e d Oc tobe r 20, 1997 •