Phantom Behavioral Assimilation Effects: Systematic Biases in Social Comparison Choice Studies

40
Phantom Behavioral Assimilation Effects: Systematic Biases in Social Comparison Choice Studies Herbert W. Marsh, 1 Marjorie Seaton, 2 Hans Kuyper, 3 Florence Dumas, 4,5 Pascal Huguet, 4 Isabelle Re ´gner, 4,6 Abraham P. Buunk, 3 Jean-Marc Monteil, 7 and Frederick X. Gibbons 8 1 Oxford University 2 University of Western Sydney 3 University of Groningen 4 Aix-Marseille University 5 Jean Moulin Lyon III University 6 University of Toulouse 7 Universite ´ Blaise Pascal Clermont-Ferrand 8 Dartmouth College ABSTRACT Consistent with social comparison theory (SCT), Blanton, Buunk, Gibbons, and Kuyper (1999) and Huguet, Dumas, Monteil, and Genestoux (2001) found that students tended to choose comparison targets who slightly outperformed them (i.e., upward comparison choices), and this had a beneficial effect on subsequent performance—a behavioral assimila- tion effect (BAE). We show (Studies 1 and 2) that this apparent BAE is due, in part, to uncontrolled measurement error in pretest achievement. How- ever, using simulated data (Study 3), these phantom BAEs were eliminated with latent-variable models with multiple indicators. In Studies 4 and 5, latent-variable models were applied to the Blanton et al. and Huguet et al. data, resulting in substantially smaller but still significantly positive BAEs. More generally in personality research based on correlational data, failure to control measurement error in pretest/background variables will positively The authors thank Hart Blanton, Ladd Wheeler, Jerry Suls, John Nezlek, Peter Tymms, John Hattie, and Oliver Lu¨dtke for comments on earlier drafts of this paper. Correspondence concerning this article should be addressed to Professor Herbert W. Marsh, Department of Education, University of Oxford, 15 Norham Gardens, Oxford OX2 6PY, UK. E-mail: [email protected]. Journal of Personality 78:2, April 2010 r 2010, Copyright the Authors Journal compilation r 2010, Wiley Periodicals, Inc. DOI: 10.1111/j.1467-6494.2010.00630.x

Transcript of Phantom Behavioral Assimilation Effects: Systematic Biases in Social Comparison Choice Studies

Phantom Behavioral Assimilation Effects: Systematic

Biases in Social Comparison Choice Studies

Herbert W. Marsh,1 Marjorie Seaton,2 Hans Kuyper,3

Florence Dumas,4,5 Pascal Huguet,4 Isabelle Regner,4,6

Abraham P. Buunk,3 Jean-Marc Monteil,7

and Frederick X. Gibbons8

1Oxford University2University of Western Sydney

3University of Groningen4Aix-Marseille University

5Jean Moulin Lyon III University6University of Toulouse

7Universite Blaise Pascal Clermont-Ferrand8Dartmouth College

ABSTRACT Consistent with social comparison theory (SCT), Blanton,Buunk, Gibbons, and Kuyper (1999) and Huguet, Dumas, Monteil, andGenestoux (2001) found that students tended to choose comparison targetswho slightly outperformed them (i.e., upward comparison choices), and thishad a beneficial effect on subsequent performance—a behavioral assimila-tion effect (BAE). We show (Studies 1 and 2) that this apparent BAE is due,in part, to uncontrolled measurement error in pretest achievement. How-ever, using simulated data (Study 3), these phantom BAEs were eliminatedwith latent-variable models with multiple indicators. In Studies 4 and 5,latent-variable models were applied to the Blanton et al. and Huguet et al.data, resulting in substantially smaller but still significantly positive BAEs.More generally in personality research based on correlational data, failureto control measurement error in pretest/background variables will positively

The authors thank Hart Blanton, Ladd Wheeler, Jerry Suls, John Nezlek, Peter

Tymms, John Hattie, and Oliver Ludtke for comments on earlier drafts of this paper.

Correspondence concerning this article should be addressed to Professor Herbert

W. Marsh, Department of Education, University of Oxford, 15 Norham Gardens,

Oxford OX2 6PY, UK. E-mail: [email protected].

Journal of Personality 78:2, April 2010r 2010, Copyright the AuthorsJournal compilation r 2010, Wiley Periodicals, Inc.DOI: 10.1111/j.1467-6494.2010.00630.x

bias the apparent effects of personality variables of interest, but widely ap-plicable approaches demonstrated here can correct for these biases.

Our study is a substantive-methodological synergy in which newmethodologies provide important new insights into unresolvedsubstantive issues (Marsh & Hau, 2007). This perspective is partic-ularly important in applied personality research, where single infal-lible indicators are typically not available. Especially when new andevolving statistical procedures are applied to substantively relevantissues, the substantive-methodological synergy has the potential toprovide guidelines for, and identify problems in, what has beenaccepted as best practice. These might include critical issues appliedresearchers must address, guidance on how these issues should beaddressed when there are no clear guidelines or golden rules (or tochallenge existing golden rules), and directions that further substan-tive research should take.

Methodologically, we demonstrate the importance of using alatent variable approach with multiple indicators of all constructsand that failure to do so can result in paradoxical or unexpectedresults. Substantively, we reconsider findings from classic studies insocial comparison theory (SCT) research showing that selection ofmore able comparison person targets (upward comparisons) resultsin better subsequent performance (a behavioral assimilation effect[BAE]) and apparent biases in the traditional analytic approachesused to demonstrate this effect. The BAEs reported in some pub-lished research are due, at least in part, to an inherent bias in thedesign and statistical analyses used to test for the effect. Here wedemonstrate why this is so, discuss situations under which it is likelyto occur, and provide a remedy to the problem.

Although the focus of this presentation is on a particular set ofSCT findings, we also illustrate why this type of problem has wideapplicability to personality research as well as psychological andsocial sciences more generally. Thus, for example, whenever person-ality researchers relate a personality variable to an outcome measureafter controlling for pretest scores or background/demographicvariables that are not appropriately controlled for measurementerror, this type of phenomena is likely to occur, positively biasing theestimated effect of the personality construct and, in extreme cases,even changing the direction of the effect.

672 Marsh, Seaton, Kuyper, et al.

Substantive Issue: Social Comparison Theory and Behavioral

Assimilation Effect

People compare themselves to others in all sorts of ways, fromcomparing the kind of car they drive with their neighbors’ cars tocomparing their golf handicap with their golfing partner. Some-times social comparison is forced upon a person by the social en-vironment, but in many situations people have considerablefreedom to choose with whom they compare themselves. As socialcomparison is such a major feature of our everyday lives, there is anabundance of research into the psychological mechanisms under-lying this element of human behavior. Festinger’s (1954) SCT de-scribes how people use others to obtain a sense of their relativestanding in order to evaluate their abilities and opinions. Because itcan be difficult to determine objectively the extent of one’s ability,comparisons with others can provide this information, establish aframe of reference for making the judgments, reduce uncertainties,and facilitate everyday decision making. Festinger proposed thatcomparison targets are more likely to be others similar in abilityon the dimensions being evaluated, as they will provide the mostaccurate evaluations—a ‘‘similarity hypothesis’’ (e.g., Hakmiller,1966; Radloff, 1966; Schachter, 1959; Wheeler, 1966) that has been‘‘the most widely cited aspect of the theory’’ (Suls, 1977,p. 4). In the case of abilities, Festinger also stated there is a uni-directional drive upward that, combined with the similarity hy-pothesis, leads individuals to compare preferentially with slightlyupward targets.

An important focus of SCT research has been the strategies (e.g.,upward and downward comparison strategies) that individuals (thechoosers) use to select comparison targets (the chosen) to maximizecompeting needs such as self-evaluation, self-improvement, and self-enhancement (Dijkstra, Kuyper, van der Werf, Buunk, & van derZee, 2008; Huguet et al., 2009; Wood, 1989). Thus, upward evalu-ations might provide a basis of identification by the choosers withmore accomplished chosen target persons. This might occur eventhough such target persons are likely to provide a more demandingbasis of comparison for self-evaluations leading to lower self-evaluations than would downward comparisons. Nevertheless,when asked to choose target persons with whom to compare them-selves, SCT research shows that choosers typically select targets who

Behavioral Assimilation Effects 673

are similar or slightly better than themselves (i.e., slightly upwardrather than downward; see Blanton, Buunk, Gibbons, & Kuyper,1999; Dijkstra et al., 2008; Huguet, Dumas, Monteil, & Genestoux,2001; Suls & Wheeler, 2000).

Why do people choose to make upward comparisons? Wheeler(1966) proposed that the ‘‘comparer is attempting to prove to himselfthat he is almost as good as the very good ones’’ (p. 30). Collins(1996, 2000) surmised that choosers who use upward comparisonsactually come to see themselves as possessing the same performancelevels as their objectively better comparison targets (upward assim-ilation). Other studies that include both objective (actual perfor-mance) and subjective (self-perceptions) measures for both choosersand the chosen suggest that choosers perceive their chosen target tobe of equal ability to them even though the chosen comparisontargets are objectively somewhat better (Huguet et al., 2009; alsosee Marsh, Trautwein, Ludtke, & Koller, 2008). For example,Nosanchuk and Erickson (1985) asked bridge players to indicatewhether they preferred comparing their bridge playing performancewith someone similar, better, or inferior to them. Participants saidthey preferred to compare themselves with others of similar ability.However, consistent with this, when participants actually namedtheir chosen comparison targets, the chosen comparison targetswere objectively somewhat better. Thus, participants perceived thesuperior players they chose as comparison targets as being similar tothem—apparently reflecting the typical positive bias in self-percep-tions. Hence, it is important to assess both objective and subjectivemeasures for both choosers and their chosen comparison targets(also see Marsh, Seaton, et al., 2008).

What are the consequences of upward comparisons? Individualstypically evaluate themselves more negatively after comparing them-selves with superior others and more positively after comparingthemselves with inferior others—a contrast effect (e.g., Brickman &Bulman, 1977; Collins, 1996, 2000; Major, Testa, & Bylsma, 1991;Marsh, Trautwein, et al., 2008; Seaton, 2007; Suls & Wheeler, 2000;Wills, 1981). This contrast effect is also consistent with a largebody of research showing that equally able students have lower self-evaluations when they attend schools where the average abilitylevel is high and higher self-evaluations when they attend schoolswhere school-average ability is low (Buckingham & Alicke, 2002;Marsh, 1987, 1991, 2007; Marsh & Craven, 2002; Marsh & Hau,

674 Marsh, Seaton, Kuyper, et al.

2003; Marsh, Seaton, et al., 2008; Marsh, Trautwein, Ludtke,Baumert & Koller, 2007). In the Tiger Woods Effect (Brown,2007), professional golfers played significantly more poorly whenTiger Woods was playing, a contrast effect. However, assimilation issometimes found when forced comparison is only slightly upward(e.g., Dumas, Huguet, Monteil, & Ayme, 2005; Huguet, Galvaing,Monteil, & Dumas, 1999; Johnson & Stapel, 2007). In free choicestudies in classroom settings, assimilation effects are also particu-larly likely when the outcome is subsequent achievement perfor-mance (e.g., Blanton et al., 1999; Dijkstra et al., 2008, Dumas,Huguet, Monteil, Rastoul, & Nezlek, 2005; Huguet et al., 2001,2009; also see Seaton et al., 2008). In their comprehensive review ofsocial comparison in classroom settings, Dijkstra et al. concludedthat upward comparison is more likely to have negative (contrast)effects on affect, self-evaluations, and particularly academic self-concept, but also to lead students to perform better (assimilation) interms of subsequent school grades and achievement tests—a BAE.Hence, upward comparisons have been described as a ‘‘double-edgedsword’’ (Major et al., 1991; see also Diener & Fujita, 1997; Seaton etal., 2008; Suls & Wheeler, 2000).

The focus of the present investigation is on assimilation and con-trast in SCT choice studies in which participants are free to choosetheir comparison targets from among actual people in their everydaylives. Of particular interest for the present investigation are two largefree choice studies (Blanton et al., 1999; Huguet et al., 2001) thatevaluated predictions from SCT in applied educational settings.In both these SCT studies, high school students’ academic perfor-mances were more likely to improve if they reported that they com-pared their exam grades with other students in their classroom whoperformed better than themselves (participants listed on a question-naire their usual comparison target in each of seven courses). Thesubstantial correlation between the achievement levels of partici-pants and their chosen comparison targets (about .5 in both studies)was based on objective information from school records, so the re-sult could not be explained in terms of a self-report bias. Consistentwith the well-established ‘‘similarity hypothesis’’ combined withthe upward drive principle, students typically chose targets whoachieved at slightly higher levels. However, not only were theachievement scores of students and their chosen target substantiallycorrelated, but students and targets also tended to be similar on a

Behavioral Assimilation Effects 675

variety of characteristics not necessarily related to achievement (e.g.,gender, age, ethnicity). Huguet et al. (2001) further demonstratedthat students compared upward with close friends with whom theyidentified as a means of self-improvement and that this identificationincreased with increased academic control, psychological closeness,and (to a lesser extent) importance of academic domains (also seeDijkstra et al., 2008). Both these studies offered direct evidence thatupward comparisons are associated with better academic achieve-ment—a BAE. The findings in these two studies were very similarto each other (see also Huguet et al., 2009; Seaton et al. 2008) andconsistent with the general pattern of results reported in the Dijkstraet al. review of the effects of upward comparison on achievement inclassroom settings.

The Present Investigation: Methodological Issues, Latent

Variable Models, and Measurement Error

SCT free choice studies of the BAE typically use manifest variableapproaches— multiple regression or analysis of variance in whichboth independent and dependent measures are based on a single in-dicator (e.g., Blanton et al., 1999; Huguet et al., 2001; also see Seatonet al., 2008). An implicit assumption in this approach is thatall measured variables are measured without measurement error.Although this assumption is generally false, what are the conse-quences? In our methodological-substantive synergy we address thisquestion, demonstrate that the answer raises some, perhaps unex-pected, substantive implications, show how this problem can beresolved with methodologically stronger models, illustrate how thisproblem and the proposed solutions have broad generalizability topersonality research, and offer recommendations for future research.

It is well known that failure to control for measurement errorattenuates relations among variables: Correlations are larger ifunreliability is controlled. This observation has led researchers tosometimes assume, at least implicitly, that unreliability merelyattenuates observed effects so that controlling for unreliability wouldnecessarily result in even stronger effects. However, the con-sequences of unreliability in structural equation models (SEMs)involving many variables are more complicated. Indeed, failure tocontrol for unreliability can either artificially increase or decrease agiven path coefficient—a positive or a negative bias.

676 Marsh, Seaton, Kuyper, et al.

Although the focus of our study is on SCT studies, the rationalefor our predictions and methodological issues are very general andapply to many areas of study. For example, consider the generalsituation in which there are two predictor variables (X1, X2) and oneoutcome variable (Y). Assume a typical scenario in which (a) bothX1 and X2 are positively correlated to each other and to the outcomevariable Y and (b) in the true population (which generally is notknown except in simulation studies) there is a true positive effect ofX1 and X2 on Y. In this situation, there will generally be some vari-ance that can be uniquely explained by X1, some uniquely explainedby X2, and some that can be explained by either X1 or X2. If therelation between X1 and Y is diluted due to measurement error, theestimated effect of X1 will go down (relative to the true populationvalue—a negative bias) and the estimated effect of X2 will increase(relative to the true population value—a positive bias). The effect ofthe measurement error in X1 will negatively bias the estimated effectof X1. However, the estimated effect of X2 will be positively biasedbecause some of the variance that should be attributed to X1 (ifit were not diluted by measurement error) will now be attributed(inappropriately) to X2. In this respect, measurement error in X1 willproduce a positive bias in the estimation of X2. However, the size ofthe bias in X2 will depend to some extent on the size of the corre-lation between X1 and X2. In particular, if X1 and X2 are exactlyuncorrelated, there is no overlap in the variance that can be ex-plained by X1 and X2. In this special case, correction for measure-ment error in X1 will have no influence on the effect of X2.Importantly, if we control for measurement error in X1, then theestimated effects of X1 and X2 remain unbiased. We can accomplishthis by collecting multiple indicators of X1 and using a SEM inwhich X1 is a latent construct purged of measurement error. In thepresent investigation, we demonstrate the nature of these effects withsimulated data for which the population values are known but alsofor the real data analyses.

Based on this general rationale as applied to SCT studies, we hy-pothesize that at least part of the observed BAE in these studies (e.g.,Blanton et al., 1999; Huguet et al., 2001) is actually a systematic biasdue to uncontrolled measurement error inherent in what was con-sidered to be ‘‘best practice’’ methodology at the time these studieswere conducted and still is the basis of most current research—evenstudies using multilevel modeling (e.g., Seaton et al., 2008). In the

Behavioral Assimilation Effects 677

present investigation, we label this as a Phantom BAE (adapting theterm ‘‘phantom effect’’ coined by Harker & Tymms, 2004, in adifferent context) and evaluate this possibility in a set of five studies.The first two studies are reanalyses of the Blanton et al. and Huguetet al. (2001) studies that found BAEs—students who chose more ablecomparison students subsequently were shown to perform betterthan students who chose less able comparison students. Becausemultiple indicators of the constructs—achievement scores or schoolgrades for individual students and their comparison choices in eachschool subject—were not available, it was not possible to controlfor unreliability though the use of multiple indicators and test forPhantom BAEs in these results. However, following the strategy ofHarker and Tymms, we systematically added random variance tothe critical pretest measure of achievement to determine effects ofunreliability in the pretest on the BAE. According to predictionsbased on the Phantom BAE, this should systematically increasethe apparent size of the BAE—a positive bias due to the failure tocontrol measurement error that we experimentally introduced inthese original studies.

Inherent in the rationale for the Phantom BAE is the assumptionthat this is a systematic bias in the estimation of a BAE (a) biasingtrue BAEs in the direction of larger BAEs, (b) producing an appar-ent BAE even if the true BAE were zero, and (c) possibly resulting inan apparently positive BAE even if the true social comparison effectwere negative (i.e., the true social comparison effect resulted incontrast rather than assimilation). The first of these predictions wassupported in Studies 1 and 2, whereas the second two are evaluatedin Study 3, using simulated data. Finally, in Study 3 we demonstratehow it is possible—under appropriate conditions—to control for thePhantom BAEs with latent variable models. Using simulated data,we tested a simple SEM in which each of the constructs in the typicalBAE was based on multiple indicators. In all models the true(known) social comparison effect is exactly zero in the population(because of the way that the population models used to generate thesimulated data were constructed) or negative (i.e., behavioral con-trast rather than assimilation). In different data sets, we systemat-ically varied aspects of the simulated data (particularly the pretestreliability posited to produce the Phantom BAE). We then testedeach of the models using an appropriate SEM and the typicalmanifest approach (based on scale scores, single indicators of each

678 Marsh, Seaton, Kuyper, et al.

construct implicitly assumed to be perfectly reliable) traditionallyused in SCT choice studies. The critical question is whether theSEMs appropriately capture the true population values (BAE5 0 or� .15, a contrast effect), whereas, as predicted, the typical manifestmodels produce consistently biased estimates in the direction of theBAE—a Phantom BAE—that varies systematically in size with theunreliability of particularly the pretest measure of achievement. Im-portantly, we are not claiming that true BAEs are zero or negative inthe real world. Rather, our focus in on the nature of the PhantomBAE—the bias in the observed BAE—if this actually was the case.

Finally, in Studies 4 and 5 we demonstrate how the SEMapproach with multiple indicators used in Study 3 (simulation) canbe adapted to be applicable to the Blanton et al. (1999) and Huguetet al. (2001) studies. Although there was only a single indicatorfor each school subject in these original studies, separate analyseswere performed on multiple school subjects. Hence, by treating thescores from different school subjects as multiple indicators of eachconstruct, we were able to test for Phantom BAEs.

STUDIES 1 AND 2

Studies 1 and 2 are reanalyses of data from the Blanton et al. (1999)and Huguet et al. (2001) studies that were also the basis of Seatonet al. (2008). These studies are described in more detail in thesepublications, so we provide only a brief overview here. All threestudies evaluated BAEs based on manifest indicators that are likelyto result in systematic biases in the observed BAEs.

Overview of Blanton et al.’s (1999) Data: Study 1

In the Netherlands, students begin their first year of high schoolwhen they are about 12 years old and attend school for between 4and 6 years. At least for the first year, students attend all courseswith the same group of classmates. Participants in the Blanton et al.(1999) study were 876 students (432 boys and 444 girls) in their firstyear of high school, across four Dutch schools, with a median age ofbetween 12 and 13 years. Grades accessed at three times (T1, T2, andT3) during the academic year were used to determine performance.For each of seven school subjects, comparison-level choice was mea-sured by asking students to nominate the classmate with whom they

Behavioral Assimilation Effects 679

preferred to compare their grades (participants could also choose notto nominate a comparison target). This student’s grade at Time 2was then used to ascertain comparison direction. For present pur-poses we only consider three subjects that are consistent across bothdata sets (Dutch [native language], math, and history) but note thatthe results were consistent across the different subjects (see Blantonet al., 1999; Seaton et al., 2008).

Overview of Huguet et al.’s (2001) Data: Study 2

Two hundred sixty-four students (129 female), from 11 classes acrosstwo French high schools, ranging from 12 to 14 years of age, par-ticipated in the Huguet et al. (2001) study. Measures were the sameas those used by Blanton et al. (1999), although the academic sub-jects differed somewhat. However, the data reported by Huguet et al.(2001) were part of a much larger, unpublished data set, consistingof 1,156 students (537 boys) from 51 classes across 12 French highschools that was subsequently used by Seaton et al. (2008) as well asby the present investigation. Participants took three school subjectsin common: French (native language), history, and math. Partici-pants in this larger data set were in the first to the fifth year of highschool, with a mean age of 13.5 years. Comparison-level choice wasmeasured as in Study 1. Grades, given on a 20-point scale with 20representing a high grade, were taken from school reports on each ofthree occasions during the same academic year and were used todetermine performance and comparison direction (i.e., upward ordownward comparison).

Methods

Statistical Analyses

For present purposes (and consistent with the original studies), multipleregression was used to estimate path coefficients (Figure 1) based on datafrom each study considered separately. In each case, the analyses wereinitially conducted on the original raw scores. After conducting theseanalyses for all of the original scores, three additional conditions wereconsidered in which a random variable (a random error component) wasadded to each (pretest) T1 achievement test score to produce a sys-tematically less reliable score. In all cases, the random variables werebased on a population that had a mean of zero. In Study 1 (where theachievement scores varied between 1 and 5), the SD of the random error

680 Marsh, Seaton, Kuyper, et al.

component was 0.5, 1, and 2, respectively (where larger error componentsmean that reliability is smaller). In Study 2 (where achievement scoresvaried along a 20-point scale) the SD of the random error component was1, 2, and 4, respectively, so that the proportion of measurement error wascomparable to Study 1.

Results

Study 1 (see Table 1)

We begin with a summary of the results based on the original scoresfrom Blanton et al.’s (1999) study of Dutch high school students. Forpresent purposes, we consider three models that can be fit to datafrom both studies (see Figure 1, Models 1, 2, and 3) predicting post-test achievement (either T2 or T3) based on pretest (T1) achievementand the achievement level of the comparison target (‘‘comp’’) chosen

Model 1 A

Model 2 Model 3

Model 1B Model 1C

Figure 1Models 1A–1C: Three models based on single-indicator (manifest)

constructs (Studies 1 and 2). Model 2: Three multiple-indicator (latent)constructs based on simulated data (Study 3). Model 3: Three multi-ple-indicator (latent constructs) based on measures for L 5 language(Dutch or French), M 5 mathematics, H 5 History (Studies 4 and 5).Curved lines between dierent indicators based on the same schoolsubject represent correlated uniquenesses. For all three models,squares represent measured variables and ovals represent latentconstructs. Dark straight lines with arrows at one end represent paths

relating the different latent constructs.

Behavioral Assimilation Effects 681

Ta

ble

1St

ud

y1

Re

an

aly

sis

of

Bla

nto

ne

ta

l.(1

99

9)

Da

ta:

Th

ree

Mo

de

ls(M

1,

M2

,M

3;

See

Fig

ure

1)

Ba

sed

on

Ori

gin

al

Da

taa

nd

Da

taW

ith

Dif

fere

nt

Am

ou

nts

of

Me

asu

rem

en

tE

rro

r(R

an

do

m1

,R

an

do

m2

,R

an

do

m3

)A

dd

ed

toT

1A

ch

iev

em

en

tSc

ore

s

ConditionPredictors

DependentVariables

Math

Achievem

ent

DutchAchievem

ent

History

Achievem

ent

T2Ach

M1A

bT3Ach

M1Bb

T3Ach

M1C

bT2Ach

M1A

bT3Ach

M1Bb

T3Ach

M1C

bT2Ach

M1A

bT3Ach

M1Bb

T3Ach

M1C

b

Originalscores

T1Ach

.74n

.70n

.20n

.72n

.71n

.27n

.62n

.69n

.33n

T2Ach

.68n

.63n

.57n

T2Comp

.19n

.13n

.00

.16n

.12n

.03

.24n

.17n

.03

MR

Sq

.81n

.60n

.73n

.59n

.60n

.74n

.57n

.59n

.73n

Random

1

T1Ach

.69n

.65n

.15n

.45n

.45n

.10n

.37n

.44n

.18n

T2Ach

.72n

.77n

.71n

T2Comp

.23n

.17n

.00

.33n

.30n

.04

.41n

.35n

.06n

MR

Sq

.67n

.55n

.72n

.39n

.36n

.72n

.38n

.39n

.70n

Random

2

T1Ach

.55n

.51n

.09n

.34n

.36n

.09n

.35n

.39n

.14n

T2Ach

.78n

.79n

.73n

T2Comp

.35n

.28n

.03

.37n

.33n

.04

.44n

.39n

.06n

MR

Sq

.56n

.45n

.72n

.31n

.30n

.72n

.36n

.35n

.69n

Random

3

T1Ach

.40n

.36n

.03n

.21n

.21n

.04n

.18n

.22n

.07n

T2Ach

.83n

.81n

.78n

T2Comp

.45n

.38n

.01

.41n

.38n

.04

.47n

.41n

.05n

MR

Sq

.45n

.33n

.72n

.25n

.22n

.72n

.28n

.25n

.68n

Note.T1,T2,andT3

5Tim

es1,2,and3.Ach

5achievem

ent,Comp

5achievem

entofcomparisonstudent.Foreach

schoolsubject

(math,

Dutch,history)threemodelswereestimatedpredictingT2Ach

from

T1Ach

andT2Comp(M

odel

1A),T3Ach

from

T1Ach

andT2Comp

(Model1B),andT3Ach

from

T1Ach,T2Ach,andT2Comp(M

odel1C).After

conductingtheseanalysesforalloftheoriginalscores,three

additionalconditionswereconsidered—

Random

1,Random

2,andRandom

3—

inwhicharandom

errorcomponentwasadded

tothe

T1Ach

score

thathadaSD

of0.5,1,and2,respectively.Thesamethreemodelsweretested

witheach

ofthesethreeconditions.

np

5o.05.

by the student. We then consider the same three models after ran-dom error was added to the pretest achievement score to simulateincreasing levels of unreliability.

Original scores. In Model 1A, only pretest achievement and com-parison student achievement are used to predict T2 achievement. Ineach of the three school subjects, there were statistically significantpositive effects of the comparison target’s achievement (.19 math, .16Dutch, .24 history). Thus, students who chose more able comparisontargets scored higher on the subsequent T2 achievement, even aftercontrolling for T1 achievement—an observed BAE. There was alsogood support for the similarity hypothesis in that each student’s ownachievement was substantially correlated with that of the compar-ison target (.54 math, .45 Dutch, .50 history). However, not surpris-ingly, the best predictor of T2 achievement was T1 pretestachievement. These results are consistent with the original conclu-sions by Blanton et al. (1999) and Seaton et al. (2008).

In Model 1B, the same analysis was performed, with T3 achieve-ment as the outcome measure, ignoring T2 achievement for now.Although the pattern of results is similar, the sizes of the effects aremostly smaller. In particular, BAEs were all about 30% smaller (.13math, .12 Dutch, .17 history).

In Model 1C, T3 achievement is again the outcome measure, but itis predicted by the chooser’s achievement at T1 and T2 as well as theT2 achievement of comparison (chosen) student. The effect of the T2achievement of the chosen comparison student’s achievement is closeto zero and not statistically significant for any of the three schoolsubjects; BAEs for T3 achievement were eliminated altogether in thismodel. However, the interpretation of these results requires furtherconsideration. Coupled with the results of Model 1B, results basedon Model 1C might suggest that there were no new social compar-ison effects (positive or negative) at T3 beyond the positive socialcomparison effects already experienced at T2. This might suggestthat the BAE based on selection of a target person at any particularpoint in time is a relatively short-term effect with few long-term im-plications. Alternatively, it might mean that students subsequentlychose new comparison targets after the T2 achievement scores. Thereis at least some support for this proposal from the original Huguetet al. (2001) study, which noted the inconsistency in students’ choiceof comparison targets over time. This pattern of results led Huguet

684 Marsh, Seaton, Kuyper, et al.

et al. (2001) to ‘‘suggest that the effects of comparison-level choice(i.e., the level typical of persons with whom one chooses to compare)diminish over time’’ (p. 557) but proposed that ‘‘social comparison is infact a dynamic process which evolves continuously over time’’ (p. 575).Consistent with this suggestion, Huguet et al. (2001) showed that T3achievement is positively and significantly predicted by T3 achievementchooser, after controlling for T2 achievement score. Importantly, how-ever, this observed pattern of results is also consistent with interpre-tations based on the Phantom BAE. T2 achievement scores are likelyto be more reliable than T1 achievement scores,1 and clearly the op-timal combination of T1 and T2 achievement scores is more reliablethan particularly T1 achievement scores considered alone. Hence, thephantom BAE should be substantially smaller when T3 achievement iscontrolled for two measures of prior achievement—particularly com-pared to Model 1A based only on T1 achievement.

Particularly relevant to our interpretation of Phantom BAEs iswhy the introduction of measurement error in T1 achievement didnot result in more substantial Phantom BAEs in Model 1C, a pointthat we pursue further in Study 3. The answer is that we only intro-duced unreliability to T1 achievement, not T2 achievement (also seeFootnote 1). Inspection of the results shows that T2 achievementwas much more predictive of T3 achievement than T1 achievement,even though the T1 and T2 scores were substantially correlated.Hence, the predictability of T3 achievement was not much affectedby adding unreliability to T1 achievement. Consistent with this ex-planation, as more measurement error was added to T1 achievement,the effects of T1 achievement went down, the effects of T2 achieve-ment went up, the BAE remained small, and the multiple R2 valueswent down only slightly (e.g., from .73 to .72 for math in Study 1).

1. In the original Blanton et al. (1999) and Huguet et al. (2001) studies, the stu-dents were in their first year in high school—just having made the sometimes

difficult transition from high school. Particularly at T1 (early in the school year),they were in the process of dealing with this transition, so that new study skills areneeded, the classroom setting is different, and levels of achievement are likely to be

less stable during this potentially traumatic period of adjustment. Also, schoolgrades are based on some combination of teacher perceptions of students based onexamination scores, class participation, and class assignments. Particularly at the

start of the school year, all of the students are new to the teacher. Hence, there is aless stable basis for making assessments at the start of the year than later in theschool year.

Behavioral Assimilation Effects 685

Had we also increased the measurement error in T2 achievement,then the Phantom BAE would have increased.

Reliability manipulations. Next we modified the original scores byadding small (Random 1), medium (Random 2), or large (Random 3)amounts of random variance to pretest scores to simulate the effects ofincreasing unreliability in T1 (pretest) achievement. In each case therewere systematic shifts in the parameter estimates that were consistentwith a priori predictions. Thus, for example, the test–retest correlationsof T1 achievement with T2 and T3 achievement became systematicallysmaller as the measurement error in T1 achievement score increased.

Of particular interest is what happens with the BAE as T1 achieve-ment becomes more unreliable. Perhaps paradoxically (but consistentwith a priori predictions), the apparent social comparison effect be-comes systematically larger as the reliability of the pretest achievementdecreases. For example, for mathematics based on Model 1A, theoriginal social comparison effect (.19) increased to .23, .35, and .45 forthe three levels of increasing unreliability, respectively. Because of thenature of the manipulations of the data, we can say unequivocally thatthese apparent increases in the size of the BAE reflect a systematic biasdue completely to the (introduced) unreliability to the T1 achieve-ment. The sizes of these biases are reasonably consistent acrossModels 1A and 1B applied to each of the three subjects. Hence, forthe most unreliable scores, the apparent size of the BAE increased by afactor of 2 or more—a phantom BAE due to a bias introduced by theadditional measurement error added to T1 achievement scores.

In Model 1C there was no BAE in the original data. Although theintroduction of unreliability into T1 achievement marginally in-creased the size of the BAE, it remained nonsignificant for math andDutch but was statistically significant (but still small) for history.However, the effect of increasing unreliability of T1 achievement isevident in that the weight given to T1 achievement on T3 achieve-ment became smaller as unreliability increased, whereas the weightgiven to T2 achievement became correspondingly larger.

Study 2 (see Table 2)

The models and presentation of results for Huguet et al.’s (2001)study of French high school students closely parallel those inStudy 1.

686 Marsh, Seaton, Kuyper, et al.

Ta

ble

2St

ud

y2

Re

an

aly

sis

of

Hu

gu

et

et

al.

(20

01)

Da

ta:T

hre

eM

od

els

(M1

,M2

,M3

;Se

eF

igu

re1

)B

ase

do

nO

rig

ina

lD

ata

an

dD

ata

Wit

hD

iffe

ren

tA

mo

un

tso

fM

ea

sure

me

nt

Err

or

(Ra

nd

om

1,

Ra

nd

om

2,

Ra

nd

om

3)

Ad

de

dto

T1

Ac

hie

ve

me

nt

Sco

res

Condition

Predictors

DependentVariables

Math

Achievem

ent

DutchAchievem

ent

History

Achievem

ent

T2Ach

M1A

bT3Ach

M1Bb

T3Ach

M1C

bT2Ach

M1A

bT3Ach

M1Bb

T3Ach

M1C

bT2Ach

M1A

bT3Ach

M1Bb

T3Ach

M1C

b

Originalscores

T1Ach

.76n

.63n

.19n

.72n

.71n

.30n

.68n

.64n

.34n

T2Ach

.58n

.58n

.44n

T2Comp

.15n

.08n

.00

.20n

.15n

.03

.17n

.10n

.03

MR

Sq

.70n

.45n

.55n

.69n

.63n

.73n

.59n

.47n

.55n

Random

1

T1Ach

.73n

.61n

.17n

.67n

.66n

.23n

.63n

.61n

.31n

T2Ach

.60n

.64n

.47n

T2Comp

.17n

.10n

.00

.24n

.19n

.03

.21n

.13n

.03

MR

Sq

.68n

.43n

.55n

.65n

.58n

.72n

.53n

.45n

.55n

Random

2

T1Ach

.64n

.53n

.12n

.54n

.52n

.12n

.58n

.54n

.23n

T2Ach

.64n

.73n

.54n

(Con

tin

ued

)

Ta

ble

2(C

on

t.)

Condition

Predictors

DependentVariables

Math

Achievem

ent

DutchAchievem

ent

History

Achievem

ent

T2Ach

M1A

bT3Ach

M1Bb

T3Ach

M1C

bT2Ach

M1A

bT3Ach

M1Bb

T3Ach

M1C

bT2Ach

M1A

bT3Ach

M1Bb

T3Ach

M1C

b

T2Comp

.24n

.16n

.00

.34n

.29n

.04

.23n

.26n

.04

MR

Sq

.59n

.27n

.54n

.54n

.46n

.70n

.49n

.38n

.53n

Random

3

T1Ach

.48n

.39n

.05

.34n

.33n

.07n

.38n

.37n

.14n

T2Ach

.70n

.78n

.62n

T2Comp

.32n

.23n

.01

.44n

.38n

.04

.34n

.25n

.04

MR

Sq

.45n

.27n

.54n

.39n

.33n

.70n

.33n

.26n

.52n

Note.T1,T2,andT3

5Tim

es1,2,and3.Ach

5achievem

ent,Comp

5achievem

entofcomparisonstudent.Foreach

schoolsubject

(math,

French,history)threemodelswereestimatedpredictingT2Ach

from

T1Ach

andT2Comp(M

1A),predictingT3Ach

form

T1Ach

and

T2Comp(M

1B),andpredictingT3Ach

from

T1Ach,T2Ach,andT2Comp(M

1C).After

conductingtheseanalysesforalloftheoriginal

scores,

threeadditionalconditionswereconsidered—

Random

1,Random

2,andRandom

3—

inwhicharandom

errorcomponentwas

added

totheT1Ach

score

thathadanSD

of1,2,or4,respectively.Thesamethreemodelsweretested

witheach

ofthesethreeconditions.

npo.05.

Original scores. As in Study 1, on the basis of Model 1A applied tothe original scores, there were statistically significant positive effectsof the comparison targets achievement for all three subjects (.15math, .20 French, .17 history). Again, there was good support for thesimilarity hypothesis in that each student’s own achievement wassubstantially correlated with that of the comparison target (.50math, .53 French, .45 history). These results are consistent withthe original results by Huguet et al. (2001) and Seaton et al. (2008)for these data. In Model 1B (with T3 achievement as the dependentvariable but ignoring T2 achievement) the pattern of results is similar,but the sizes of the BAEs are 30%–50% smaller (.08 math, .15 Dutch,.10 history). In Model 1C (with T3 achievement as the dependentvariable but including T1 and T2 achievement as predictor variables),the BAEs were no longer statistically significant for any of the schoolsubjects. (Huguet et al., 2001, also reported a nonsignificant findingwhen predicting chooser’s T3 achievement from the T2 chosen stu-dent’s achievement while controlling for T2 scores.) In summary, theseresults are highly consistent with those based on our reanalysis ofBlanton et al. (1999) in Study 1 of the present investigation.

Reliability manipulations. When we modified the pretest achieve-ment scores by adding measurement error (random noise) to T1achievement, the apparent size of the BAE increased substantially inModels 1A and 1B for all three subjects. In each case this bias for themost unreliable T1 achievement scores resulted in BAEs that appar-ently increased in size by a factor of 2 or 3—a phantom BAE due tothe additional measurement error in T1 achievement scores. ForModel 1C, none of the BAEs based on the original scores was sta-tistically significant, and the corresponding estimates after introduc-ing additional measurement error to T1 achievement remainednonsignificant.

Discussion

The results of Studies 1 and 2 provide remarkably consistent results.Based on the original (unmodified) scores, there was consistent sup-port for a BAE based on Models 1A and 1B: Choosing a more ablestudent as a comparison target had positive effects on subsequentachievement even after controlling for prior achievement. However,increasing the unreliability in the pretest achievement score (by add-

Behavioral Assimilation Effects 689

ing random noise: measurement error) substantially increased theapparent size of this BAE. Because of the manner in which we ma-nipulated the reliability of the T1 (pretest) achievement, we can sayunequivocally that this apparent increase in the BAE was completelydue to a bias caused by the unreliability of the pretest scores: aphantom BAE.

What do our results—particularly those based on Models 1A and1B—say about the nature of the assimilation effect reported byBlanton et al. (1999) and Huguet et al. (2001) and by the Seaton et al.(2008) reanalysis of these data? Unfortunately, we cannot say a lotwithout knowing the reliability of the achievement test scores. Be-cause there is clearly some unreliability in the achievement scores, wecan be confident in saying that the BAEs in these studies were at leastsomewhat positively biased—a Phantom BAE. Although it might bepossible that much of the BAE could be explained as a PhantomBAE, the nature of the data considered here provides no basis forsuch an extreme conclusion. On the other hand, if pretest achieve-ment measures were extremely reliable, then the Phantom BAEwould be able to explain only a small portion of the observed BAE.In Study 3 we consider these issues further with simulated data.

STUDY 3: A LATENT VARIABLE APPROACH WITHSIMULATED DATA

Study 3 uses simulated data based on the population generatingmodel in Figure 1 (Model 2; also see values used to generate the datain Table 3). Although not frequently used in personality research,simulated data are ideal for answering ‘‘what if’’ questions under ahighly constrained situation completely under the control of the re-searcher who generates the data. Of course, the validity of the in-terpretations of analyses of simulated data is only as good as theassumptions used to generate the data. Here we ask two particularlyprovocative questions in relation to BAEs:

1. If the true BAE is actually zero in the population, is there aPhantom BAE such that the observed BAE is significantlypositive and can we eliminate this bias with appropriately con-structed SEMs?

2. If the true social comparison effect is actually negative (i.e.,contrast rather than assimilation), can the Phantom BAE

690 Marsh, Seaton, Kuyper, et al.

Ta

ble

3St

ud

y3

:R

esu

lts

Ba

sed

on

Sim

ula

ted

Da

ta

Conditions

ModelsRepresentingthe12Sim

ulationConditions

M1

M2

M3

M4

M5

M6

M7

M8

M9

M10

M11

M12

Populationparameters

Factorloadings

T1Ach

.50

.75

.90

.50

.75

.90

.50

.75

.90

.50

.75

.90

Comp

.75

.75

.75

.75

.75

.75

.90

.90

.90

.90

.90

.90

T2Ach

.90

.90

.90

.90

.90

.90

.90

.90

.90

.90

.90

.90

Path

coefficient

T1Ach

–T2Ach

.70

.70

.70

.70

.70

.70

.70

.70

.70

.70

.70

.70

Factorcorrelations

T1Ach

–T1Ach

.70

.70

.70

.70

.70

.70

.70

.70

.70

.70

.70

.70

T1Ach

–Comp

.50

.50

.50

.70

.70

.70

.50

.50

.50

.70

.70

.70

Reliability

T1Ach

.50

.80

.93

.50

.80

.93

.50

.80

.93

.50

.80

.93

Comp

.80

.80

.80

.80

.80

.80

.93

.93

.93

.93

.93

.93

T2Ach

.92

.93

.93

.93

.93

.93

.93

.93

.93

.92

.93

.93

Manifestvariable

estimates(C

omp–T2Ach

50)

Path

coefficients

T1Ach

–T2Ach

.42

.57

.64

.36

.53

.63

.41

.57

.64

.33

.52

.62

Comp–T2Ach

.16

.07

.02

.26

.12

.04

.18

.08

.03

.29

.15

.05

MR

Sq

.25

.37

.43

.28

.37

.43

.25

.37

.43

.29

.38

.43

(Con

tin

ued

)

Ta

ble

3(C

on

t.)

Conditions

ModelsRepresentingthe12Sim

ulationConditions

M1

M2

M3

M4

M5

M6

M7

M8

M9

M10

M11

M12

Correlations

T1Ach

–Comp

.31

.40

.43

.44

.56

.60

.34

.43

.47

.48

.60

.65

T1Ach

–T2Ach

.47

.60

.65

.47

.60

.65

.47

.60

.65

.47

.60

.65

Comp–T2Ach

.30

.30

.30

.42

.42

.42

.32

.33

.33

.45

.46

.46

Manifestvariable

estimates(C

omp–T2Ach

5�.15)

Path

coefficients

T1Ach

–T2Ach

.43

.56

.62

.34

.51

.59

.41

.56

.63

.33

.50

.60

Comp–T2Ach

.04

�.05

�.09

.14

.01

�.06

.05

�.05

�.11

.16

.01

�.07

MR

Sq

.20

.29

.34

.18

.26

.31

.18

.29

.35

.18

.26

.31

Correlations

T1Ach

–Comp

.31

.39

.43

.44

.51

.60

.34

.43

.46

.47

.60

.65

T1Ach

–T2Ach

.44

.54

.58

.41

.55

.55

.43

.54

.58

.41

.51

.55

Comp–T2Ach

.18

.17

.17

.29

.29

.29

.19

.19

.19

.32

.32

.32

Note.T1,T2

5Tim

es1,2.Ach

5achievem

ent,Comp

5achievem

entofcomparisonstudent.Alldata

weresimulatedaccordingto

the

populationgeneratingmodel

shownin

Figure

1(M

odel

2);

24conditionswereconsidered

thatsystem

aticallyvaried

levelsofpretest

achievem

entfactorloadings(.5,.75,.90),

comparisonstudentachievem

ent(.75,.90),

thecorrelationbetweenpretest

achievem

entand

comparisonstudentachievem

ent(.5,.7),andtheeffect

ofcomparisonstudentachievem

entonposttestachievem

ent(0,�.15).All24data

sets

wereinitiallyfitwithstructuralequationmodels(all24solutionswereable

tofitthedata

inthatthechi-square

test

ofstatistical

significance

wasnotstatisticallysignificantandtheparameter

estimatesdid

notdiffersignificantlyfrom

thepopulationvalues

usedto

generate

thedata).Data

from

all24conditionswerethen

analyzedwithatypicalmanifest(m

ultipleregression)approach

inwhichthethree

multiple

indicators

ofeach

construct

wereaveraged

toobtain

asingle

(manifest)scale

score

torepresenteach

variable.Presentedare

key

parameter

estimatesfrom

thesemanifestvariable

analyses.

operate such that the observed BAE is in the direction of as-similation (i.e., shift a true negative effect into a positive onedue to this bias) and can we eliminate this bias with appropri-ately constructed SEMs?

Importantly, because these data are simulated and not ‘‘real,’’ we arenot claiming that BAEs in the real world are negative or even zero.Rather, we are asking the hypothetical questions as to what would bethe effects of measurement error if these scenarios actually were true.

In this simulation there are three constructs: T1 (pretest) achieve-ment, T2 (posttest) achievement, and T2 achievement level of the com-parison target student. Unlike in Studies 1 and 2, each of the constructsis based on responses to three multiple indicators (Model 2 in Figure 1).Each of the nine measured variables was based on data randomlysampled from a normal distribution having mean50 and SD51. Toexplore more fully the implications of the Phantom BAE, we generated24 separate data sets according to 3� 2� 2� 2 factorial design:

� three levels of factor loadings relating each indicator of T1achievement to the latent T1 achievement construct: .5, .75, .9

� two levels of factor loadings relating each indicator of socialcomparison student achievement to the latent social compar-ison student achievement construct: .75, .9

� two levels of correlation between the latent T1 achievementand comparison target factors: .5, .7

� two levels of true social comparison effect: 0 (neither assim-ilation nor contrast) or � .15 (contrast)

For each of these 24 data sets, a single replication was generatedconsisting of 10,000 cases.

Now we briefly justify the rationale for the selection of values wechose in this simulation study. We began by choosing factor loadingsfor T1 (pretest) achievement (school grades early in the school year)to represent a wide range of reliabilities (i.e., higher factor loadingsmean that the scores are more reliable). Reliability estimates for thethree levels are .50 (corresponding to factor loadings of .5), .8 (forfactor loadings of .75), and .93 (for factor loadings of .90). The valueof .50 was designed to reflect a reasonable lower bound for teacher-constructed achievement tests, school grades, and ad hoc measuresof achievement. The .8 value was selected to reflect carefully con-structed teacher-made tests, school grades, and typical achievement

Behavioral Assimilation Effects 693

tests that might be administered as part of a particular study. The .93value was designed to reflect good commercial tests of standardizedachievement. For T2 achievement (school grades later in the schoolyear), factor loadings were fixed at a reasonably high level of reli-ability (factor loadings of .90, representing a reliability of .93). Thisassumption was based on the fact that school grades assigned earlyin the school year (before teachers have had a chance to really get toknow the ability levels of each student and students are still accli-mating to a new class) are likely to be less reliable than those basedlater in the school year (that are likely to be based at least in part onaccumulated evidence), particularly for new students after the tran-sition from primary to secondary school (as was the case in originalBlanton et al., 1999, and Huguet et al., 2001, studies; see Footnote1). Hence, the reliability for T2 achievement (the outcome variable inthis analysis) was fixed at the upper end of the range of reliabilitiesconsidered for T1 achievement so that T2 achievement was eithermore reliable than T1 achievement or equally reliable to T1 achieve-ment in different conditions.

We did, however, choose to vary the reliability of the comparisonstudent, allowing it take on the upper two of the three values con-sidered for T1 achievement. Hence, the reliability of the T2 com-parison student achievement was either the same as the T2achievement (.93) or a lower value (.8). The rationale here is thateven though T2 achievement and T2 achievement of the comparisonstudent were based on the same set of scores, there was likely to bean additional source of error in the comparison of student achieve-ment based on the unreliability of the choice behavior. Indeed, Hug-uet et al. (2001) specifically noted that choice of comparison studentswas only moderately consistent over time, although newly selectedcomparison target students also tended to choose students withachievement levels that were slightly higher (slightly upward com-parisons). Finally, correlations between student achievement and theachievement of the comparison target were about .5 in Studies 1 and2, so that the true score correlation (disattenuated for unreliability)would be systematically higher. Hence, we selected values of .5 and .7.

Results and Discussion

For all 24 cells of the design, the SEMs were able to appropriatelyrecover estimates based on the population generating model (see

694 Marsh, Seaton, Kuyper, et al.

Table 3); all chi-square values were nonsignificant and all parameterestimates were approximately the same as in the population gener-ating model. Technically, this result is hardly surprising in that thesame model was used to model the data as was used to generate it.Substantively, the results are important in that they demonstratethat, under at least these ideal conditions, SEMs control for mea-surement error appropriately, thereby completely eliminating thePhantom BAE. We now evaluate results based on the traditionalmanifest approach, considering first results based on the 12 cells inwhich the true population BAE is zero and then the 12 cells in whichthe true population BAE is negative (i.e., contrast rather than assim-ilation). For purposes of these analyses, each of the four constructswas represented by a single manifest variable: the mean score based onthe three multiple indicators considered in each of the SEMs.

True Population BAE5 0

We begin by evaluating the 12 cells in which the true populationBAE is zero (neither assimilation nor contrast). In the correspondinganalyses based on manifest variables, there is a statistically signifi-cant Phantom BAE in all 12 conditions. This is the key finding ofthese analyses, because the (known) true population BAE is zero.Hence, each of these observed BAEs is due completely to the Phan-tom BAE—a bias due to the failure to control for measurementerror. The size of this bias varied systematically with the variablesmanipulated in the construction of the data. The bias was largestwhen T1 achievement was least reliable, particularly when the cor-relation between T1Ach and the comparison target was also large(path coefficients of .26 and .29). Under these conditions, the size ofthe Phantom BAE is still moderate when the reliability of the T1achievement was substantial (.80), resulting in path coefficients of .12and .15. It is only when the reliability estimates of the T1 achieve-ment scores are extremely reliable (.93) that the size of the bias—thePhantom BAE—is small, path coefficients varying between .02 and.05 (and only statistically significant due to the large N). Thus, withhighly reliable manifest indicators, there are almost no phantomBAEs, although even these very small biases were evident in analysesof manifest variables and eliminated in the corresponding SEMs.

Variation in the reliability of the comparison target scores hasonly a small effect: Biases are slightly larger when these scores are

Behavioral Assimilation Effects 695

more reliable. However, the size of the correlation between T1achievement and comparison target scores has a larger effect onthe size of the bias: The bias is systematically larger when the size ofthis correlation is larger (i.e., own achievement and that of the com-parison target are highly correlated).

True Population Social Comparison Contrast Effect (� .15)

We now turn to the 12 cells in which the true (population) socialcomparison effect is negative—contrast rather than assimilation.Except for this social comparison effect (BAE5 � .15), all otherpopulation parameters are the same as in the first 12 cells (withBAE5 0). Of particular relevance is the size and direction of theobserved social comparison effect based on the manifest approach.Although the true population parameter is � .15 (a modest contrasteffect), the biased estimates of this effect vary from � .11 (contrast)to 1.14 (assimilation). Indeed, the biased estimates are in the direc-tion of assimilation rather than contrast in 6 of the 12 cells. Again,the size of the bias is based substantially on reliability of particularlythe pretest measure of achievement, but also the size of the corre-lation between T1 achievement and comparison target scores. Again,the Phantom bias is systematically larger when pretest scores are lessreliable and when the correlation between T1 and T2 achievementscores is larger.

These results demonstrate that the Phantom bias works in a sys-tematic direction of assimilation (a) producing an apparently posi-tive BAE when the true BAE is zero and (b) in some cases, resultingin an apparently positive BAE (assimilation) even when the true so-cial comparison effect is negative (i.e., contrast rather than assimi-lation). Again, however, we are not claiming that the BAE is actuallynegative or even zero in the real world. Rather, we only demonstratethat if this really were the case, then the Phantom BAE could resultin a misinterpretation of the direction of the effect, but that this biasis eliminated with appropriate constructed SEMs.

STUDIES 4 AND 5

In Studies 4 and 5, we return to the data from the original (Blantonet al., 1999; Huguet et al., 2001; Seaton et al., 2008) studies. In eachof these original studies (as well as Studies 1 and 2 in the present

696 Marsh, Seaton, Kuyper, et al.

investigation), separate analyses were conducted for each subject. InStudies 4 and 5 we treat the three school subjects as three multipleindicators of each of the latent constructs. In this respect, we beginwith latent-variable models (with multiple indicators) like those con-sidered with simulated data in Study 3. However, as already noted,both studies are based on school grades that are idiosyncratic to eachteacher. Individual teachers tend to grade on a curve and differ id-iosyncratically in terms of the nature of materials used to determinegrades and the leniency with which grades are assigned. FollowingSeaton et al., we explore alternative ways of dealing with these prob-lems in applying a latent variable model to these data.

Methods

Study 4: Dutch (Blanton et al., 1999) Data (See Table 4)

The data for Study 4 are the same data used in Study 1. Dutch teachers inthe Blanton et al. (1999) study were able to set their own tests and assigngrades according to their own criteria. When teachers assign grades in thisfashion, there is a typical grading on a curve phenomenon such that samestudent would get a lower grade if he or she were in a class where most ofthe students were brighter and a higher grade if he or she were in a classwhere most of the students were less bright. Because of a potential idio-syncrasy in absolute grade levels assigned by different teachers, Blanton etal. removed the between-class effects by creating a set of 32 dummy vari-ables. Alternatively, Seaton et al. (2008), in their reanalysis of these samedata, took advantage of information about the way students were as-signed to different tracks2 so that grades were moderated to be on a

2. In the Blanton et al. (1999) study, each individual student had one of threeability levels (labeled as low, medium, and high) based on previous academicachievement. Individual classes represented various combinations of these levels,

such that there were four class types: low, low/medium, low/medium/high, andmedium/high. Thus, for example, ‘‘low’’ class types had all students who were lowin terms of prior achievement levels, whereas ‘‘medium/high’’ class types had a

mixture of student who were medium and high in terms of prior achievement. Forpurposes of the present study, these categorizations were used to rescale the av-erage ability levels on a 0–3 scale according to the class type, such that low5 0,

medium5 1.5, and high5 3. Thus the four class types were assigned values oflow5 0, low/medium5 .75 (i.e., [011.5]/2), low/medium/high5 1.5 (i.e.,[011.513]/3), and medium/high5 2.25 (i.e., 1.513]/2). This constant value was

then added to the (within-class) standardized grades of each student, dependingon the class type the student attended. We began by standardizing all scores withineach class so that these preliminary scores had M5 0, SD5 0, for each class. We

Behavioral Assimilation Effects 697

common metric, a strategy that we used here as well (see Footnote 2; alsosee Seaton et al. for further discussion). For purposes of Study 4, welimited consideration to only those students who had complete data forall three subjects (n5 642). Results were based on nine measured vari-ables (Model 3, Figure 1): T1Ach, T2Ach, and comparison student scoresin each of three school subjects. When the same measured variables arecollected on more than one occasion, there is typically a pattern of re-lations such that matching scores (e.g., T1 math grades and T2 mathgrades) are likely to be more highly correlated than nonmatching scores(e.g., T1 math grades and T2 history grades), particularly when grades onboth occasions are assigned by the same teacher using idiosyncratic cri-teria. As recommended by Joreskog (1979), the appropriate model shouldinclude correlated uniquenesses relating these matching terms. As em-phasized by Marsh and Hau (1996), the failure to include correlateduniquenesses positively biases the size of test–retest correlations, whichwill also lead to biases in other path coefficients. A similar argument canbe made for relations between school grades and the grades of compar-ison students for matching and nonmatching school subjects. Hence, oura priori model includes three sets of three correlated uniqueness (nine inall) relating the three scores associated with each school subject.

Although not a focus of the original Blanton et al. (1999) study orreanalyses of these data in Study 1 of the present investigation, these dataare nested in that students are clustered within classes. Depending on thenumber of classes, the number of students in each class, and the size ofcluster effects, single-level analyses (that ignore this clustering) result inbiased estimates of standard errors: Standard errors tend to be too smalland inflate Type 1 error rates. To take into account these clustering effects

then added the constant value based on the (known a priori) quality of students ineach separate class. Thus, the mean for students in the low track classes had avalue of 0 added to their scores so that their scores did not change. However, for

students in the medium/high track classes, a constant of 12.25 was added to eachscore. After making these adjustments, achievement scores for all students werethen standardized so that the grand mean across all students was zero. Hence, the

rank ordering of students within each class was not altered by our transformationof the achievement scores, but the adjusted scores were systematically higher inclasses where students were initially brighter. (For a more detailed rationale for

this adjustment procedure and support for its appropriateness, see Seaton et al.,2008.) We also note that these transformations were not necessary in Study 1 ofthe present investigation (or the original Blanton et al. study) because analyses

were conducted separately for each subject (where the class and teacher were heldconstant) rather than across different subjects where the teachers (and their id-iosyncrasies in assigning grades) were different.

698 Marsh, Seaton, Kuyper, et al.

Ta

ble

4L

ate

nt

Va

ria

ble

(Mu

ltip

leIn

dic

ato

r)M

od

els

of

the

Ph

an

tom

Be

ha

vio

ral

Ass

imil

ati

on

Eff

ec

t

FactorLoadings

ResidualVariance/C

ovariancesofMeasuredVariables

T1Ach

T2Comp

T2Ach

12

34

56

78

9

Study4Dutchresponses

T1Achievem

ent

1Dutch

.85

.27

2Math

.81

—.34

3Hist

.82

——

.32

T2Comparisonstudent

4Dutch

.83

.04

——

.31

5Math

.81

—.02

——

.35

6History

.80

——

.06

——

.37

T2Achievem

ent

7Dutch

.88

.09

——

.03

——

.23

8Math

.85

—.14

——

.02

——

.28

9Hist

.82

——

.12

——

.09

——

.33

Path

coefficients

T1TACH

——

T2TCMP

.73

——

T2TACH

.95

.06

(Con

tin

ued

)

Ta

ble

4(C

on

t.)

FactorLoadings

ResidualVariance/C

ovariancesofMeasuredVariables

T1Ach

T2Comp

T2Ach

12

34

56

78

9

Study5French

Responses

T1Achievem

ent

1French

.78

.39

2Math

.75

—.43

3History

.85

——

.27

T2Comparisonstudent

4French

.63

.08

——

.60

5Math

.63

—.08

——

.61

6History

.65

——

.02

——

.58

T2Achievem

ent

7French

.81

.01

——

.17

——

.35

8Math

.80

—.06

——

.22

——

.36

9History

.86

——

.08

——

.04

——

.25

Path

coefficients

T1TACH

——

T2TCMP

.51

——

T2TACH

.95

.09

Note.Completely

standardized

parameter

estimates.T1andT2

5Tim

es1and2;Comp

5comparisonperson’sscore.Threeschoolsubjects

wereconsidered

ineach

study:math,history,andnativelanguage(French

orDutch).Theapriorimodel

provides

verygoodfitforboth

data

sets:w2(15)

55.26(French)33.18(D

utch);Confirm

atory

FitIndex

5.99(D

utch),1.00(French);RootMeanSquare

ErrorofAp-

proxim

ation

5.03(D

utch),.00(French).

However,thesolutionfortheDutchdata

wastechnicallynotfullyproper

asthenumber

of

classroomswasless

thanthenumber

ofestimatedparameters,dictatingthatresultsshould

beinterpretedcautiously.

and control for this bias, we used the ‘‘complex design’’ option in Mplus(Muthen & Muthen, 2008).

Study 5: French (Huguet et al., 2001) Study (See Table 4)

French students in the Huguet et al. (2001) study were given grades on a20-point scale that were taken from school reports and were used to de-termine performance and comparison direction. Importantly, schoolgrades in the French system are specifically designed to be comparableacross school subjects, classes, and schools (i.e., to counteract the typicalgrading-on-a-curve effect). Furthermore, schools were selected so thatgrades were comparable across schools based on information provided bythe local director of education (P. Huguet, personal communication,February 2, 2005). For this reason and because there was no other basisof scaling the class-average achievement values for the different classes,the same transformation was used for each class. Hence, for purposes ofthe present investigation, grades were standardized (M5 0, SD5 1). Ex-cept for the transformation of grades, the statistical analyses and latent-variable (multiple indicators) models considered in Study 5 were the sameas in Study 4. As in Study 4, we again limited consideration to only thosestudents who had complete data for all three subjects (n5 946).

Results and Discussion

For both studies, all latent constructs are well defined (all factorloadings are statistically significant and substantial) and the a priorimodel provides an excellent fit to the data. Results based on both theDutch and French data (Table 4) are similar in terms of providingsupport for a priori predictions about the application of latent-vari-able (multiple indicators) models. Of particular relevance are thepath coefficients relating comparison student scores to T2 achieve-ment scores, controlling for the effects of T1 achievement. Consis-tent with results of the original studies, the effects of comparisonstudent scores are positive for both Study 4 (.06, po.05) and Study 5(.09, po.05)—a true BAE. However, consistent with predictionsbased on the Phantom BAE, both of these observed BAEs are sub-stantially smaller than estimates based on the multiple regressionanalyses of manifest variables in Studies 1 and 2. These results pro-vide clear support for the need to evaluate social comparison effectswith appropriate latent-variable models that control for unreliabilityin SCT choice studies. As shown in earlier studies in the present in-vestigation, failure to do so results in positively biased estimates of

Behavioral Assimilation Effects 701

the BAE: a phantom BAE. Importantly, however, even after con-trolling for the bias due to unreliability, both studies continued todemonstrate a small but statistically significant assimilation effect.

Although Studies 4 and 5 each support the need for appropriatelatent variable models, there are important limitations in these stud-ies that require further consideration and, perhaps, more research. Inparticular, reliance on school grades (and the ad hoc ways in whichwe dealt with them in each study) is a matter of concern. Alternativesolutions to this problem might use standardized test scores insteadof school grades or use a combination of standardized test scores andschool grades such that potentially idiosyncratic schools grades fromdifferent classes are equated in relation to standardized test scores(see, e.g., Marsh, Kong, & Hau, 2000). A second complication is thatschool grades for the same student in different school subjects arelikely to differ by more than just random error. In this respect, es-timates of unreliability are likely to be inflated (i.e., systematic vari-ation is likely to be treated as random error, contributing tounreliability). To some extent, the inclusion of correlated unique-nesses in the SEM provides a control for this problem. Nevertheless,a stronger design would have multiple indicators for each constructbased on the same academic subject (as was the case for the simu-lated data used in Study 3). Finally, although not given much at-tention in the present investigation, failure to control for clustereffects can result in substantially biased estimates of standard errors.Particularly in studies in which the sample sizes are not large (interms of number of classes and numbers of students), this can resultin inflated Type 1 error rates. Whereas there are alternative analyt-ical approaches to address this problem (e.g., multilevel models orcomplex design effects as used here), they typically require largesample sizes (in terms of students and classes) when the number ofestimated parameters is large (as will typically be the case whenconstructs are represented by multiple indicators).

SUMMARY AND IMPLICATIONS

In Studies 1 and 2 we demonstrated small but statistically significant(positive) social comparison effects—BAEs—based on reanalyses ofdata from two published studies (Blanton et al. 1999; Huguet et al.,2001; Seaton et al., 2008). These results were consistent with findings

702 Marsh, Seaton, Kuyper, et al.

based on the original publications. However, when we manipulatedthe T1 (pretest) achievement scores by adding measurement error(i.e., decreasing their reliability), the apparent size of the BAEs in-creased substantially. The results conclusively answered several im-portant questions but left others unanswered. Conclusively, weshowed that increasing levels of measurement error (unreliability)in T1 achievement scores produces a systematic, positive bias in es-timated BAEs: the Phantom BAE. Furthermore, failure to controlfor this measurement error will always result in biased estimates inSCT free choice studies like those considered here. However, becausethere was no basis for estimating the actual reliability of T1 achieve-ment scores in either of these studies, we could not evaluate howmuch of the observed BAE was a bias due to the Phantom BAE.Most importantly, because we had only a single estimate of eachconstruct and no reasonable estimate of reliability in these separateanalyses of each subject, we were unable to control for the PhantomBAE in order to obtain a more accurate estimate of the true BAE. Insummary, we could confidently say that BAE estimates in the orig-inal studies were positively biased and demonstrated that this biasmight be substantial, but we could not estimate the actual size of thebias in either Study 1 or 2.

To pursue these limitations, in Study 3 we constructed simulateddata sets that experimentally manipulated critical parameters of thedata—particularly the reliability of T1 achievement—while fixing thetrue BAE to be either zero (neither assimilation or contrast) or neg-ative (contrast rather than assimilation). Based on manifest-variablemodels using multiple regression analyses as in Studies 1 and 2, wedemonstrated that there were statistically significant Phantom BAEsunder all 24 conditions. In the conditions in which the true popu-lation social comparison effect was zero (BAE5 0), the estimated(biased) Phantom BAEs were all positive and moderately large undersome conditions (e.g., low T1 achievement reliability). In the condi-tions in which the true population social comparison effect was neg-ative (contrast rather than assimilation), half of the conditionsresulted in significant assimilation rather than contrast. Taken to-gether, the results of Study 3 demonstrate that the Phantom BAE is asystematic bias in the direction of assimilation; it can produce anillusionary BAE when it does not exist and even result in an illu-sionary BAE when the true social comparison effect is contrastrather than assimilation. Importantly, because each of the latent

Behavioral Assimilation Effects 703

constructs had multiple indicators, we were also able to demonstratethat the application of traditional SEMs completely eliminated thePhantom BAE in all 24 conditions. Again, however, we are notclaiming that BAEs actually are negative or even zero in the realworld. Rather, we only demonstrate that if this really were the case,then the Phantom BAE could result in misinterpretation of the di-rection of the effect, but that this bias is eliminated with appropri-ately constructed SEMs.

Finally, in Studies 4 and 5, we applied latent variable (multipleindicator) models to data considered in Studies 1 and 2. Despitecomplications in applying the latent variable model from Study 3 tothese data, the results showed that controlling for unreliability inpretest measures substantially reduced the size of the BAE. Impor-tantly, however, even after controlling for unreliability, the BAEswere still significantly positive (in the direction of assimilation).Hence, these results suggest that observed positive BAEs in boththese studies represent a combination of a Phantom BAEs (a positivebias) and true positive BAEs. Particularly relevant to the presentinvestigation is that the studies also demonstrate a new approach tothe evaluation of BAEs based on latent-variable models that controlPhantom BAEs.

What are the implications of this study for SCT free choice stud-ies? On the basis of our results, we recommend that researchersshould always strive to maximize reliability on the basis of multipleindicators of their key constructs—particularly pretest measures thatare especially critical in biases associated with the Phantom BAE.However, this is not sufficient, as there will always be some mea-surement error that will produce some bias in the observed BAEs.The use of multiple indicators allows SCT researchers to applySEMs that eliminate the Phantom BAEs under appropriate condi-tions. In evaluating SCT choice studies, it is important to reassessstudies based on manifest variable approaches. All such studies usingdata collection and methodology as in Studies 1 and 2 are likely to bepositively biased, at least to some extent. Importantly, the presentinvestigation demonstrates how the effect of unreliability can resultin a systematic positive bias (overestimating positive effects and un-derestimating negative effects) that is very different from the typicalattenuation effect of unreliability (i.e., underestimating the size ofboth positive and negative relations). Whereas it is clearly unwar-ranted to claim that all BAEs in SCT choice studies are Phantom

704 Marsh, Seaton, Kuyper, et al.

BAEs (e.g., the true BAEs were still positive in Studies 4 and 5 aftercontrolling for measurement error), the methodological approachesdemonstrated here provide a much stronger basis for pursuing theseissues.

It is also important to identify potential limitations and directionsfor further research. Simulation studies are limited by assumptionsunderlying the construction and analysis of the simulated data and theappropriateness of values chosen for conditions that were experimen-tally manipulated (and those that were not). In particular, the validityof the conclusions depends at least in part on the generalizability ofthe values used in the simulation study. However, it is easy to showmathematically that the direction of the bias must be positive so longas the correlations among the T1 achievement of the chooser, the T2achievement of the chooser, and the T1 achievement of the chosen areall positive (i.e., test-retest correlations are positive and chooserschoose comparison targets similar to themselves). Because these con-ditions appear to be very general in SCT research, at least the direc-tion of the bias should have broad generalizability. However, evenwhen the direction of the bias is positive, the Phantom BAE resultingfrom this bias will tend to disappear—even with manifest variableapproaches—under conditions of extremely reliable measurement.

We also note that many SCT studies—particularly those in edu-cational settings—are likely to have a multilevel structure (e.g., stu-dents nested within classes, classes nested within schools). Hence, theSeaton et al. (2008) reanalysis of data considered here used a mul-tilevel statistical analysis because these data did have a multilevelstructure. If data have a strong multilevel structure, the standarderrors are likely to be substantially underestimated, leading to in-flated Type 1 error rates. Whereas our simulation (Study 3) wasconstructed not to have a multilevel structure, SCT choice studies ineducational settings typically will. SCT studies have typically notapplied either multilevel modeling or SEMs (for notable exceptionsregarding multilevel modeling, see Dumas et al., 2005; Huguet et al.,2009; Seaton et al., 2008), and we know of no previous SCT choicestudies that have applied currently available procedures that incor-porate both SEMs based on multiple indicators (to control for un-reliability) and multilevel models or complex design effects (likethose we used in Studies 4 and 5) that take into account the clustereffects that are typical in educational research and in social scienceresearch more generally. Clearly, incorporation of these evolving

Behavioral Assimilation Effects 705

statistical procedures into SCT choice studies is an important direc-tion for future research and a potentially important limitation ofexisting research.

We also note that, although the focus of the present investigationis on the SCT and the effects of choosing more or less able comparisontarget students, the general principles have much broader applica-tion to a wide variety of quasi-experimental (and even experimental)research in which researchers rely on analysis of covariance strate-gies to correct for preexisting differences. The results of the presentinvestigation clearly demonstrate that failure to adequately controlmeasurement error in the pretest covariates can lead to systematicbiases in the effects of other predictors of the outcome variables, butthat under appropriate conditions this bias can be eliminated. Thenature, direction, and size of this bias will vary with the pattern ofrelations among the variables involved that are idiosyncratic toparticular applications but can be derived mathematically under ap-propriate circumstances. Although this observation is not new, itsimplications are often not fully recognized in applied research in per-sonality, psychology, and the social sciences more generally. Thus,for example, whenever personality researchers relate a personalityconstruct to an outcome measure after controlling for pretest scoresor background/demographic variables that are not appropriatelycontrolled for measurement error, the estimated effect of the person-ality variable is likely to be biased—possibly substantially. Althoughthe size of this bias will vary with the particular circumstances (andcharacteristics like those considered here), the size of the bias can besubstantial. In extreme circumstances—as demonstrated here—thebias can even change the direction of the effect (i.e., result in a sig-nificantly positive effect when the true effect is actually negative).Importantly, the SEM latent-variable models demonstrated hereprovide a way to control for these biases. In this sense, methodolog-ical aspects of the present investigation have broad applicability topersonality and psychological research more generally.

REFERENCES

Blanton, H., Buunk, B. P., Gibbons, F. X., & Kuyper, H. (1999). When better-

than-others compare upward: Choice of comparison and comparativeevaluation as independent predictors of academic performance. Journal ofPersonality and Social Psychology, 76, 420–430.

706 Marsh, Seaton, Kuyper, et al.

Brickman, P., & Bulman, R. J. (1977). Pleasure and pain in social comparison. In

J. M. Suls & R. L. Miller (Eds.), Social comparison processes: Theoretical andempirical perspectives (pp. 149–186). Washington, DC: Hemisphere.

Brown, J. (2007). Quitters never win: The (adverse) incentive effects of competing

with superstars. Berkeley: Department of Agricultural and Resource Econom-ics, University of California, Berkeley.

Buckingham, J. T., & Alicke, M. D. (2002). The influence of individual versus

aggregate social comparisons and the presence of others on self-evaluations.Journal of Personality and Social Psychology, 83, 1117–1130.

Collins, R. (1996). For better or worse: The impact of upward social comparison

on self-evaluations. Psychological Bulletin, 119, 51–69.Collins, R. (2000). Among the better ones. Upward assimilation in social com-

parison. In J. Suls & L. Wheeler (Eds.),Handbook of social comparison: Theoryand research (pp. 159–171). New York: Kluwer Academic/Plenum.

Diener, E., & Fujita, F. (1997). Social comparisons and subjective well-being. InB. P. Buunk & F. X. Gibbons (Eds.), Health, coping and well-being: Perspec-tives from social comparison theory (pp. 329–358). Mahwah, NJ: Erlbaum.

Dijkstra, P., Kuyper, H., van der Werf, G., Buunk, A. P., & van der Zee, Y. G.(2008). Social comparison in the classroom: A review. Review of EducationalResearch, 78, 828–879.

Dumas, F., Huguet, P., Monteil, J.-M., & Ayme, E. (2005). Context effects in theStroop task: Knowledge of one’s relative standing makes a difference. CurrentPsychology Letters: Cognition, Brain, and Behavior, 16, 1–12.

Dumas, F., Huguet, P., Monteil, J.-M., Rastoul, C., & Nezlek, J. B. (2005). Social

comparison in the classroom: Is there a tendency to compare upward in ele-mentary school? Current Research in Social Psychology, 10, 166–187.

Festinger, L. (1954). A theory of social comparison processes.Human Relations, 7,

117–140.Hakmiller, K. L. (1966). Threat as a determinant of downward comparison.

Journal of Experimental Social, 2, 32–39.

Harker, R., & Tymms, P. (2004). The effects of student composition on schooloutcomes. School Effectiveness and School Improvement, 15, 177–199.

Huguet, P., Dumas, F., Marsh, H. W., Regner, I., Wheeler, L., & Suls, J., et al.

(2009). Clarifying the role of social comparison in the Big-Fish-Little-PondEffect (BFLPE): An integrative study. Journal of Personality and Social Psy-chology, 97, 156–170.

Huguet, P., Dumas, F., Monteil, J.-M., & Genestoux, N. (2001). Social compar-

ison choices in the classroom: Further evidence for students’ upward compar-ison tendency and its beneficial impact on performance. European Journal ofSocial Psychology, 31, 557–578.

Huguet, P., Galvaing, M.-P., Monteil, J. M., & Dumas, F. (1999). Social presenceeffects in the Stroop task: Further evidence for an attentional view ofsocial facilitation. Journal of Personality and Social Psychology, 77, 1011–

1025.Johnson, C. S., & Stapel, D. A. (2007). No pain, no gain: The conditions under

which upward comparisons lead to better performance. Journal of Personalityand Social Psychology, 92, 1051–1067.

Behavioral Assimilation Effects 707

Joreskog, K. G. (1979). Statistical estimation of structural models in longitudinal inves-

tigations. In J. R. Nesselroade & B. Baltes (Eds.), Longitudinal research in thestudy of behavior and development (pp. 303–351). New York: Academic Press.

Major, B., Testa, M., & Bylsma, W. H. (1991). Responses to upward and down-

ward social comparisons: The impact of esteem-relevance and perceived con-trol. In J. Suls & T. A. Wills (Eds.), Social comparison: Contemporary theoryand research (pp. 237–260). Mahwah, NJ: Erlbaum.

Marsh, H. W. (1987). The big-fish-little-pond effect on academic self-concept.Journal of Educational Psychology, 79, 280–295.

Marsh, H. W. (1991). Failure of high ability schools to deliver academic bene-

fits commensurate with their students’ ability levels. American EducationalResearch Journal, 28, 445–480.

Marsh, H. W. (2007). Self-concept theory, measurement and research into practice:The role of self concept in educational psychology. London: British Psycholog-

ical Society.Marsh, H. W., & Craven, R. (2002). The pivotal role of frames of reference in

academic self-concept formation: The Big Fish Little Pond effect. In F. Pajares

& T. Urdan (Eds.), Adolescence and education (pp. 83–123). Greenwich, CT:Information Age.

Marsh, H. W., & Hau, K.-T. (1996). Assessing goodness of fit: Is parsimony al-

ways desirable? Journal of Experimental Education, 64, 364–390.Marsh, H. W., & Hau, K. (2003). Big-fish-little-pond-effect on academic self-

concept. A cross-cultural (26 country) test of the negative effects of academ-ically selective schools. American Psychologist, 58, 364–376.

Marsh, H. W., & Hau, K.-T. (2007). Applications of latent-variable models ineducational psychology: The need for methodological-substantive synergies.Contemporary Educational Psychology, 32, 151–171.

Marsh, H. W., Kong, C. K., & Hau, K. T. (2000). Longitudinal multilevel modelsof the Big-Fish-Little-Pond Effect on academic self-concept: Counterbalancingcontrast and reflected glory effects in Hong Kong schools. Journal of Person-

ality and Social Psychology, 78, 337–349.Marsh, H. W., Seaton, M., Trautwein, U., Ludtke, O., Hau, K. T., & O’Mara, A.

J., et al. (2008). The big-fish-little-pond-effect stands up to critical scrutiny:

Implications for theory, methodology, and future research. Educational Psy-chology Review, 20, 319–350.

Marsh, H. W., Trautwein, U., Ludtke, O., Baumert, J., & Koller, O. (2007). The bigfish little pond effect: Persistent negative effects of selective high schools on self-

concept after graduation. American Educational Research Journal, 44, 631–669.Marsh, H. W., Trautwein, U., Ludtke, O., & Koller, O. (2008). Social comparison

and big-fish-little-pond effects on self-concept and other self-belief constructs:

Role of generalized and specific others. Journal of Educational Psychology, 100,510–524.

Muthen, L., & Muthen, B. (2008). Mplus user’s guide (5th ed.). Los Angeles, CA:

Author.Nosanchuk, T. A., & Erickson, B. H. (1985). How high is up? Calibrating social

comparison in the real world. Journal of Personality and Social Psychology, 48,624–634.

708 Marsh, Seaton, Kuyper, et al.

Radloff, R. (1966). Social comparison and ability evaluation. Journal of Exper-

imental Social Psychology, 2, 6–26.Schachter, S. (1959). The psychology of affiliation. Stanford, CA: Stanford

University Press.

Seaton, M. (2007). The big-fish-little-pond effect under the grill: Tests of its uni-versality, a search for moderators, and the role of social comparison. Unpub-lished doctoral dissertation, University of Western Sydney.

Seaton, M., Marsh, H. W., Dumas, F., Huguet, P., Monteil, J.-M., & Regner, I.,et al. (2008). In search of the big fish: Investigating the coexistence of the Big-Fish-Little-Pond effect with the positive effects of upward comparison. British

Journal of Social Psychology, 47, 73–103.Suls, J. (1977). Social comparison theory and research: An overview from 1954. In

J. Suls & R. L. Miller (Eds.), Social comparison processes: Theoretical and em-pirical perspectives (pp. 1–19). Washington, DC: Wiley.

Suls, J., & Wheeler, L. (2000). A selective history of classic and neo-social com-parison theory. In J. Suls & L. Wheeler (Eds.), Handbook of social comparison:Theory and research (pp. 3–19). New York: Kluwer Academic/Plenum.

Wheeler, L. (1966). Motivation as a determinant of upward comparison. Journalof Experimental Social Psychology, 2, 27–31.

Wills, T. A. (1981). Downward comparison principles in social psychology. Psy-

chological Bulletin, 90, 245–271.Wood, J. V. (1989). Theory and research concerning social comparisons of per-

sonal attributes. Psychological Bulletin, 106, 231–248.

Behavioral Assimilation Effects 709

710