Measurement equivalence and latent mean differences of personality scores across different media and...

10
Measurement equivalence and latent mean differences of personality scores across different media and proctoring administration conditions Gargi Sawhney , Konstantin P. Cigularov Old Dominion University, 5115 Hampton Blvd., Norfolk, VA 23529, USA article info Article history: Keywords: Measurement equivalence Latent means Personality Paper-and-pencil test Computer-based test Proctoring abstract Despite substantial interest and research in measuring personality, little is known about the measure- ment equivalence and mean differences in scores on personality measures across different administration conditions. The aim of the present study was to assess measurement equivalence and latent and observed mean differences of scores on the Big Five factor markers from the International Personality Item Pool across three conditions: paper-and-pencil proctored, computer-based proctored, and computer-based non-proctored. Undergraduate students (N = 401) from a Midwestern university responded to the per- sonality questionnaire in one of the three conditions. Results indicated configural, metric, scalar, and invariant uniqueness equivalence for four of the five scales across the three conditions; Conscientious- ness scores showed partial metric equivalence across computer-based proctored and computer-based non-proctored conditions. Apart from latent and observed mean differences for Emotional Stability scores in paper-and-pencil proctored vs. computer-based non-proctored conditions, no significant differences were found for the other four personality scales. These findings justify both collection and comparison of personality data using the Big Five factor markers and similar personality assessments across the three conditions. Future research should attempt to replicate the findings of the current study in high-stakes environments. Ó 2014 Elsevier Ltd. All rights reserved. 1. Introduction Since the popularization of computers and the emergence of Internet, more and more organizations have begun to switch from traditional paper-and-pencil (PP) formats of their psychological measures to computer-based (CB) formats due to the many advan- tages of CB testing (Stanton, 1998). CB testing, in turn, has led to the emergence of non-proctored testing where a respondent com- pletes a measure in the absence of a proctor (Tippins et al., 2006). In fact, many organizations have developed CB versions of their measures for human resource practices (Mead & Blitz, 2003; Salgado & Moscoso, 2003) and educational purposes (Wise & Plake, 1990). This interest in CB testing hinges on the advantages that the CB administration offers relative to the traditional PP for- mat. For example, using computers and Internet for collecting psy- chological data usually requires fewer resources for test administration, maintenance, and updates, and at the same time can ensure more efficient and accurate data entry and scoring (Booth-Kewley, Larson, & Miyoshi, 2007). The Internet can also provide an easier access to large test-taker pools, which is difficult to achieve with PP methodology (Buchanan & Smith, 1999). However, the widespread use of technology for psychological assessment purposes has raised concerns among test developers, administrators, and users about the comparability of scores and their interpretation between different formats (Skitka & Sargis, 2006). Thus, there has been a growing interest in establishing the measurement equivalence (ME) of psychological measures admin- istered in CB and PP formats (Donovan, Drasgow, & Probst, 2000). In fact, current best practices in testing recommend that ME be demonstrated before directly comparing scores of CB and PP psychological measures (see Standards for Educational and Psychological Testing, 1999). While most of the research on ME between different medium of administration has focused on cognitive ability measures (Mead & Blitz, 2003), researchers have also begun to examine the ME of non-cognitive psychological measures, such as personality ques- tionnaires. Initial findings suggest that administration format has little effect on scores of personality measures (e.g., Chuah, Drasgow, & Roberts, 2006; Vecchione, Alessandri, & Barbaranelli, 2012). However, only a handful of studies used random assignment http://dx.doi.org/10.1016/j.chb.2014.04.010 0747-5632/Ó 2014 Elsevier Ltd. All rights reserved. Corresponding author. Address: Old Dominion University, Department of Psychology, Mills Godwin Building, Room 250, Norfolk, VA 23529, USA. Tel.: +1 773 615 0128. E-mail addresses: [email protected] (G. Sawhney), [email protected] (K.P. Cigularov). Computers in Human Behavior 36 (2014) 412–421 Contents lists available at ScienceDirect Computers in Human Behavior journal homepage: www.elsevier.com/locate/comphumbeh

Transcript of Measurement equivalence and latent mean differences of personality scores across different media and...

Computers in Human Behavior 36 (2014) 412–421

Contents lists available at ScienceDirect

Computers in Human Behavior

journal homepage: www.elsevier .com/locate /comphumbeh

Measurement equivalence and latent mean differences of personalityscores across different media and proctoring administration conditions

http://dx.doi.org/10.1016/j.chb.2014.04.0100747-5632/� 2014 Elsevier Ltd. All rights reserved.

⇑ Corresponding author. Address: Old Dominion University, Department ofPsychology, Mills Godwin Building, Room 250, Norfolk, VA 23529, USA. Tel.: +1773 615 0128.

E-mail addresses: [email protected] (G. Sawhney), [email protected](K.P. Cigularov).

Gargi Sawhney ⇑, Konstantin P. CigularovOld Dominion University, 5115 Hampton Blvd., Norfolk, VA 23529, USA

a r t i c l e i n f o a b s t r a c t

Article history:

Keywords:Measurement equivalenceLatent meansPersonalityPaper-and-pencil testComputer-based testProctoring

Despite substantial interest and research in measuring personality, little is known about the measure-ment equivalence and mean differences in scores on personality measures across different administrationconditions. The aim of the present study was to assess measurement equivalence and latent and observedmean differences of scores on the Big Five factor markers from the International Personality Item Poolacross three conditions: paper-and-pencil proctored, computer-based proctored, and computer-basednon-proctored. Undergraduate students (N = 401) from a Midwestern university responded to the per-sonality questionnaire in one of the three conditions. Results indicated configural, metric, scalar, andinvariant uniqueness equivalence for four of the five scales across the three conditions; Conscientious-ness scores showed partial metric equivalence across computer-based proctored and computer-basednon-proctored conditions. Apart from latent and observed mean differences for Emotional Stability scoresin paper-and-pencil proctored vs. computer-based non-proctored conditions, no significant differenceswere found for the other four personality scales. These findings justify both collection and comparisonof personality data using the Big Five factor markers and similar personality assessments across the threeconditions. Future research should attempt to replicate the findings of the current study in high-stakesenvironments.

� 2014 Elsevier Ltd. All rights reserved.

1. Introduction

Since the popularization of computers and the emergence ofInternet, more and more organizations have begun to switch fromtraditional paper-and-pencil (PP) formats of their psychologicalmeasures to computer-based (CB) formats due to the many advan-tages of CB testing (Stanton, 1998). CB testing, in turn, has led tothe emergence of non-proctored testing where a respondent com-pletes a measure in the absence of a proctor (Tippins et al., 2006).In fact, many organizations have developed CB versions of theirmeasures for human resource practices (Mead & Blitz, 2003;Salgado & Moscoso, 2003) and educational purposes (Wise &Plake, 1990). This interest in CB testing hinges on the advantagesthat the CB administration offers relative to the traditional PP for-mat. For example, using computers and Internet for collecting psy-chological data usually requires fewer resources for testadministration, maintenance, and updates, and at the same time

can ensure more efficient and accurate data entry and scoring(Booth-Kewley, Larson, & Miyoshi, 2007). The Internet can alsoprovide an easier access to large test-taker pools, which is difficultto achieve with PP methodology (Buchanan & Smith, 1999).

However, the widespread use of technology for psychologicalassessment purposes has raised concerns among test developers,administrators, and users about the comparability of scores andtheir interpretation between different formats (Skitka & Sargis,2006). Thus, there has been a growing interest in establishing themeasurement equivalence (ME) of psychological measures admin-istered in CB and PP formats (Donovan, Drasgow, & Probst, 2000).In fact, current best practices in testing recommend that ME bedemonstrated before directly comparing scores of CB and PPpsychological measures (see Standards for Educational andPsychological Testing, 1999).

While most of the research on ME between different medium ofadministration has focused on cognitive ability measures (Mead &Blitz, 2003), researchers have also begun to examine the ME ofnon-cognitive psychological measures, such as personality ques-tionnaires. Initial findings suggest that administration format haslittle effect on scores of personality measures (e.g., Chuah,Drasgow, & Roberts, 2006; Vecchione, Alessandri, & Barbaranelli,2012). However, only a handful of studies used random assignment

G. Sawhney, K.P. Cigularov / Computers in Human Behavior 36 (2014) 412–421 413

of participants to administration conditions (e.g., Meade, Michels,& Lautenschlager, 2007), or ensured complete anonymity of partic-ipants (e.g., Buchanan & Smith, 1999; Davidov & Depner, 2011).Furthermore, because PP administrations are usually proctoredand CB administrations usually use the Internet and are not proc-tored, only a few published studies have attempted to disentanglemedia effects from proctoring effects (e.g., Chuah et al., 2006).Finally, we were able to locate only one study that performed bothstrict tests of ME, as well as tested for latent mean differences (e.g.,Meade et al., 2007). Therefore, more research utilizing ME andlatent mean difference analyses is needed to establish the inter-changeability of scores on CB and PP personality measures, whilecontrolling for proctoring effects.

To address the above issues, the current study assessed the MEand latent mean differences of scores on the Big Five factor mark-ers from the International Personality Item Pool (Goldberg, 1992)across three administration conditions: paper-and-pencil proc-tored (PPP), computer-based proctored (CBP), and computer-basednon-proctored (CBN). For the purposes of the present study, PPPadministration was defined as using a paper copy of the question-naire to respond to measures in a controlled setting (i.e., researchlab) in the presence of a researcher; CBP administration wasdefined as using a computer to respond to measures in a controlledsetting (i.e., research lab) in the presence of a researcher; and CBNadministration included completing measures on a computer in anon-controlled setting (i.e., outside the research lab) without aresearcher present.

As per Drasgow and Kanfer (1985), ME is inferred ‘‘when therelations between observed scores and latent constructs areidentical across relevant groups’’ (p. 662). Stated differently,assessing the ME of personality scores across different adminis-tration conditions allowed us to examine the extent to whichthe items of the evaluated personality measures were inter-preted and answered in the same way by respondents acrossconditions (Cheung & Rensvold, 2002; Vandenberg & Lance,2000). ME ensures that the scores on the personality measuresadministered under different conditions or using different mediacan be interpreted and compared in a meaningful way. If there isno evidence of ME, ‘‘the basis for drawing scientific inference isseverely lacking: findings of differences between individuals andgroups cannot be unambiguously interpreted’’ (Horn & McArdle,1992, p.117).

More specifically, we aimed to determine whether participantsin the three administration conditions: (a) adopted the sameframe-of-reference (i.e., configural equivalence), (b) used similarmetrics (i.e., metric equivalence), (c) exhibited similar responsebias (i.e., scalar equivalence), (d) were similarly consistent in theirresponses (i.e., invariant uniqueness), and (e) showed similar meanlevels of the measured personality constructs (i.e., latent andobserved mean differences) while completing the personality mea-sures (Vandenberg & Lance, 2000). Finding support for the above-mentioned equivalences would suggest interchangeability ofscores across conditions and justify comparisons of observed scoremeans (Raju, Laffitte, & Byrne, 2002) and collapsing of data (Cole,Bedeian, & Feild, 2006).

1.1. Medium effects on measurement equivalence

While some scholars have argued that medium of administra-tion has little effect on responses to psychological measures(Booth-Kewley, Edwards, & Rosenfeld, 1992; Ford, Vitelli, &Stuckless, 1996), others have insisted that the responses collectedvia computers are not equivalent to those collected via paper cop-ies (Booth-Kewley et al., 2007; Lautenschlager & Flaherty, 1990). Inas early as 1969, Evan and Miller suggested that socially desirable

responding may be reduced when measures containing sensitiveand personal items are administered via a computer because ofgreater anonymity perceived by respondents. More recently,Lautenschlager and Flaherty (1990) found that CB administrationof personality measures increased the likelihood of impressionmanagement. McKenna and Bargh (2000) pointed out that ano-nymity is less of a concern with the Internet. Hence, individualsare more likely to open up about certain aspects of their lives thatthey are usually secretive about, thus leading to more honestresponding. On the other hand, individuals completing PP mea-sures may have a higher motivation to respond in a desirable man-ner (i.e., acquiescence response bias; Robert, Lee, & Chan, 2006),which may compromise the equivalence of responses between PPand CB conditions.

Mead and Blitz (2003) cautioned that even though CB non-cog-nitive measures may provide more privacy, respondents may notalways trust them, especially if they are required to provide someform of identification to acquire access. Other researchers havesuggested that individuals’ responses to personality measuresmay be affected by unfamiliarity with computers (Chuah et al.,2006). Mead and Blitz (2003) noted that respondents can completeCB measures during non-standardized hours and location. Further,the manner in which items are presented in CB testing may lead todifferences in responses between the PP and CB conditions (Hofer& Green, 1985). Items on a PP test can be viewed all at once, whichmay lead participants to respond in haste, and possibly overlooksome items. On the contrary, respondents may pay more attentionto the items in CB testing if only few items are presented on thecomputer screen at a time, thus affecting the interpretation ofitems.

Despite speculations about plausible reasons for why mediumof administration may affect the ME between PP and CB versionsof personality measures, there is scarcity of research evidence tosupport these concerns. In fact, we were unable to locate anystudies that found non-equivalence of personality measuresacross PP and CB forms of administration. However, we foundtwo studies that reported non-equivalence across non-cognitivemeasures (i.e., Fang, Wen, & Prybutok, 2013; Hirai, Vernon,Clum, & Skidmore, 2011). Interestingly, majority of the researchsuggests that medium of administration does not affect the com-parability of responses to personality measures (Chuah et al.,2006; De Beuckelaer & Lievens, 2009; Vecchione et al., 2012).Mead and Blitz’s (2003) meta-analytic review of six studies,which used a within-subjects design, revealed support for MEof non-cognitive measures across CB and PP forms of administra-tion. Similar findings were reported in a meta-analysis of 65medical studies, which revealed ME of measures of health statusacross PP and CB formats in patient samples (Gwaltney, Shields,& Shiffman, 2008).

Given the preponderance of research supporting the compara-bility of scores on non-cognitive, especially personality, measuresbetween PP and CB administration conditions, we expected thatthe Big Five factor markers will perform similarly across the twoconditions in terms of psychometric properties (i.e., measurementequivalence), as well as mean levels (i.e., latent and observedmeans). Thus, we hypothesized the following:

Hypothesis 1. Personality scores will demonstrate measurement(i.e., configural, metric, scalar, and invariant uniqueness) equiva-lence across paper-and-pencil proctored and computer-basedproctored administration conditions.

Hypothesis 2. Personality scores will demonstrate equal latentand observed means across paper-and-pencil proctored and com-puter-based proctored administration conditions.

414 G. Sawhney, K.P. Cigularov / Computers in Human Behavior 36 (2014) 412–421

1.2. Proctoring effects on measurement equivalence

Psychological measures can be administered in proctored, aswell as non-proctored settings. Proctored testing takes place whenthe test-taker completes a measure in the presence of a proctor,usually at a specified location and time (Ployhart, Weekley,Holtz, & Kemp, 2003). On the other hand, non-proctored testingoccurs in the absence of a proctor or test administrator, and thetest-taker has the freedom to choose the location and often thetime for taking the test.

A number of reasons have been offered as to why personalitymeasures may produce non-equivalent responses across proc-tored and non-proctored administration settings. For example,non-proctored (as compared to proctored) testing environmentsmay increase perceptions of anonymity and control in respon-dents and decrease demand characteristics, which can motivaterespondents to disclose sensitive information and provide morevalid responses (Davis, 1999). On the other hand, respondentsin proctored settings can ask questions for clarification, and thusinterpret the questions as the researcher has intended, resultingin more accurate responses (Mead & Blitz, 2003). Further, it maybe difficult in non-proctored conditions to ascertain who com-pleted the measure, whether they were assisted by other people,and how many times the person completing the measurereviewed/edited the measure before submitting it (Tippinset al., 2006). Despite the flexibility and practicality allowed byCB non-proctored measures in terms of non-standardized com-pletion hours and locations, Mead and Blitz (2003) caution thatsuch flexibility may come with a price. As per Buchanan andSmith (1999, p. 130), ‘‘someone doing the test, for example, ina noisy computer lab will experience a different set of environ-mental stimuli or distractions to somebody completing the teston a computer in their home or workplace,’’ which can affectthe interpretation of and responding style to personality mea-sures, thus impacting their ME.

Interestingly, limited research exists regarding the effects ofproctoring (i.e., proctored vs. non-proctored administration) onresponses to personality measures, which largely fails to supportthe above speculations. For instance, using a within-subjectsdesign, Templer and Lange (2008) showed high correlationsbetween proctored and non-proctored Internet testing onresponses to five personality scales. In their study, participationwas not anonymous and breaks were allowed, but not recordedduring the proctored condition. Bartram and Brown (2004) testedME of a Big Five personality measure across proctored PP andnon-proctored CB administrations by comparing covariance struc-tures, and found comparable results. In a study of job applicants,who completed the Hogan Personality Inventory under proctoredand non-proctored conditions, Davies and Wadlington (2006)revealed adequate metric equivalence; no ME tests beyond metricequivalence were performed.

Overall, previous research on ME of personality measures acrossproctored and non-proctored administration conditions, whilevery limited, suggests that the psychometric performance of per-sonality measures may not be affected by different proctoring con-ditions. Consequently we proposed the following hypotheses:

Hypothesis 3. Personality scores will demonstrate measurement(i.e., configural, metric, scalar, and invariant uniqueness) equiva-lence across computer-based proctored and computer-based non-proctored administration conditions.

Hypothesis 4. Personality scores will demonstrate equal latentand observed means across computer-based proctored and com-puter-based non-proctored administration conditions.

2. Method

2.1. Participants

Participants were undergraduate students from psychologyclasses in a Midwestern university (n = 415), who completed thestudy in one of three conditions: (a) paper-and-pencil proctored(PPP; n = 112, response rate = 50.9%), (b) computer-based proc-tored (CBP; n = 118, response rate = 51.7%), and (c) computer-based non-proctored (CBN; n = 185, response rate = 83.3%). Four-teen cases were excluded from the CBN condition due to missingdata, reducing the final sample size to 401.

The mean age of all participants was 21.2 years (SD = 4.24), and58.6% were male. Approximately 4.3% participants did not indicatetheir age, while 4.1% did not report their gender. The racial compo-sition was 50.0% Caucasian, 25.3% Asian/Pacific Islander, 5.8% Afri-can American, 5.2% other, and 4.0% multiracial. About 0.7% of theparticipants were unsure of their race, and 0.2% chose ‘Not applica-ble’. Participants who did not indicate their race made up 4.1% ofthe sample. In our sample, 28.9% were sophomores, 25.6% wereseniors, 21.5% were juniors, and 19.1% were freshmen. The remain-ing 5.2% of the students did not indicate their university status.Majority of the participants were full-time students (89.0%), whileonly a few reported part-time student status (6.9%). In terms ofwork status, 49.6% participants were working at the time of thesurvey administration, 46.0% were not working, and 4.3% did notindicate their work status. Participants represented more than 25different majors, among which the most represented were engi-neering (41.2%), architecture (16.1%), and psychology (10.4%).

2.2. Procedure

Participants were recruited through in-class announcementsand flyers in the Psychology department of a Midwestern univer-sity. Interested students were asked to sign up by providing theirname and institutional email address. At this point, students wererandomly assigned to one of three conditions: (a) taking a surveyin a paper-and-pencil format in a research lab with the researcherpresent, i.e., paper-and-pencil proctored condition (PPP), (b) takingthe survey on a computer in a research lab with the researcherpresent, i.e., computer-based proctored condition (CBP), and (c)taking the survey on a computer and at a location of their choicewithout the researcher present, i.e., computer-based non-proc-tored condition (CBN).

The formatting and layout of the survey were identical acrossthe three above-mentioned conditions. CB surveys were hostedby Survey Monkey. Participants in the CBP and CBN conditionswere required to respond to all items before proceeding to the nextpage. Although they could go back and edit their responses toitems, they could not skip questions. Participants in the PPPcondition were instructed not to skip questions and at the end ofthe session were asked to double-check for missing responses.

Students who were assigned to the two proctored conditionswere sent an email with instructions and a link to a calendar siteto sign up for one of the time slots provided to participate in thestudy. After they scheduled a session using the web-link, theyreceived an email, confirming the date and time of the appoint-ment. Participants in both PPP and CBP conditions checked in byproviding their name, e-mail address, class for which they wantedto apply extra credit, and name of the instructor for that class. Sim-ilarly, participants in both these proctored conditions completedthe survey in a lab using either paper-and-pencil or a computer.A researcher was present at all times in the two proctored condi-tions. Prior to completing the survey, participants in all three con-ditions (i.e., PPP, CBP, and CBN) were required to read, initial, and

G. Sawhney, K.P. Cigularov / Computers in Human Behavior 36 (2014) 412–421 415

date a cover letter which informed them about the study, proce-dures, risks and benefits.

Students assigned to the CBN condition were sent an email withinstructions and a link to the survey, which they were asked tocomplete within 7 days. Participants in the CBN condition weresent two reminder emails to ensure higher response rates. At theend of the survey, participants were asked to provide personalinformation (name, e-mail address, class for which they wantedto apply extra credit, and name of the instructor for that class).However, participants were informed that their personal informa-tion would not be linked to the survey responses. Participation inthe study was voluntary. In all three conditions, participants wereasked to read a cover letter, which informed them about the study,procedures, risks, and benefits. All participants were offered coursecredit for participating in the study. Approval from the Institu-tional Review Board was obtained prior to conducting the study.

2.3. Measures

The 50-item set of the Big Five factor markers from the Interna-tional Personality Item Pool (Goldberg, 1992) were used to assessOpenness to Experience, Conscientiousness, Extraversion, Agree-ableness, and Emotional Stability. Each scale included 10 items.All items were measured on a 5-point Likert scale (1 = ‘‘Very inac-curate’’ to 5 = ‘‘Very accurate’’). Sample items include ‘‘I spend timereflecting on things’’ (Openness to Experience), ‘‘I pay attention todetails’’ (Conscientiousness), ‘‘I talk to a lot of different people atparties’’ (Extraversion), ‘‘I take time out for others’’ (Agreeable-ness), and ‘‘I get stressed out easily’’ (Emotional Stability). Cron-bach’s alpha coefficients for Openness to Experience,Conscientiousness, Extraversion, Agreeableness, and EmotionalStability in the combined sample were .79, .81, .90, .83, and .88,respectively.

2.4. Statistical analyses

To test our hypotheses, we conducted statistical analyses inthree steps using EQS for Windows, version 6.1 (Bentler, 2004):(a) within-group tests of measurement models; (b) between-grouptests of ME; and (c) tests of latent and observed mean differences(Byrne, 2006; Vandenberg & Lance, 2000). More detailed descrip-tion of the analyses is presented below.

2.4.1. Within-group tests of measurement modelPrior to testing ME, the fit of a hypothesized one-factor mea-

surement model for each of the Big Five personality scales wastested using confirmatory factor analysis (CFA) within each condi-tion. Model fit for the CFAs was evaluated with two commonlyused goodness-of-fit indices, which have been found to be robustin analyses of small samples: comparative fit index (CFI) and rootmean square error of approximation (RMSEA; Byrne, 2006;Vandenberg & Lance, 2000). Ranging from 0 to 1, a CFI value of.90 or higher was considered acceptable, and a value greater than.95 was desired (Hu & Bentler, 1999). RMSEA values between .06and .08 suggested acceptable fit, while values of .05 and below

Table 1Measurement equivalence and latent mean difference tests.

Model number Model type Model constraints

1 Configural equivalence None2 Metric equivalence Factor loadings3 Error covariance equivalence Error covariances4 Scalar equivalence Item intercepts5 Invariant uniqueness Error variances6 Latent mean differences Latent means

indicated good fit (Hu & Bentler). The chi-square test statistic,which has been traditionally used as an indicator of model fit,‘‘provides a highly sensitive statistical test, but not a practical test,of model fit’’ (Cheung & Rensvold, 2002, p. 234). Thus, chi-squarestatistics, although reported for CFAs, were not interpreted in rela-tion to model fit. In cases when the indices indicated poor fit of themodel to the data, misfitting parameters were identified throughthe multivariate Lagrange Multiplier (LM) test (Byrne, 2006).

2.4.2. Between-group tests of measurement equivalenceOnce adequate fit was found for the measurement models

within each condition, a sequence of increasingly restrictive testsof ME, depicted in Table 1, were performed on three paired condi-tions (i.e., PPP vs. CBP, PPP vs. CBN, and CBP vs. CBN), following rec-ommendations from the literature (Ployhart & Oswald, 2004;Vandenberg & Lance, 2000). In these analyses, each model addedadditional equality constraints and was tested against a less con-strained model.

Configural equivalence, the first ME test, ascertained that thenumber of factors and pattern of factor loadings were the sameby constraining them across the paired conditions. This test exam-ined whether respondents in different administration conditionsassociated the same items with the same constructs or latent vari-ables. Support for configural equivalence would imply that partic-ipants across the pairs used the same frame-of-reference whencompleting the personality measure.

Finding support for configural equivalence justified a test ofmetric equivalence by constraining factor loadings of like itemsto be equal across the paired conditions. As suggested by Cheungand Rensvold (2002), metric non-equivalence was indicated if theconstrained model fit deteriorated substantially compared to thenon-constrained (i.e., configural) model, i.e., when the change inCFI (DCFI) was greater than .01. Finding evidence of metric equiv-alence would suggest that participants across administration con-ditions were adopting the same metric to respond to the items.While tests of configural and metric equivalence are consideredweak forms of equivalence, tests of scalar equivalence andinvariant uniqueness are often considered overly stringent(Vandenberg & Lance, 2000).

Support for metric equivalence allowed us to proceed withexamining error covariance equivalence, if needed. This test wasconducted if commonly specified covariances were estimatedwhile testing the measurement model across paired conditions(Byrne, 2006). Here, in addition to constraining loadings, similarerror covariances that were common for scales across the pairedconditions were constrained to be equal. Similar to metric equiva-lence, a DCFI greater than .01 between the fit of the newly con-strained model (i.e., error covariance) and the configural modelindicated non-equivalence. Support for this test would allow usto continue with testing for scalar equivalence.

Tests of scalar equivalence (or the equivalence of item inter-cepts) were conducted by constraining the intercepts to be equalfor like items across the paired conditions, in addition to equal fac-tor loadings and error covariances. A DCFI larger than .01 betweenthe fit of the scalar model and the error covariance model would

Model interpretation

Pattern of fixed and estimated parameters across conditions is equalFactor-item relationships across conditions are equalError covariances across conditions are equalResponse bias across conditions is equalIndicators are measured with the same errorUnderlying means of constructs are equal across conditions

Table 2Within-group tests of 1-factor model fit for PPP, CBP, and CBN conditions.

Factor v2 df CFI RMSEA 90% CI of RMSEA

PPP (n = 112)Opennessa 45.12* 31 .953 .064 [.005, .102]Conscientiousnessb 48.24* 30 .924 .075 [.031, .112]Extraversionc 37.48 32 .987 .040 [.000, .084]Agreeablenessd 37.66 34 .980 .031 [.000, .078]Emotional Stabilitye 46.46* 32 .960 .064 [.007, .101]

CBP (n = 118)Opennessf 31.56 30 .995 .021 [.000, .074]Conscientiousnessg 44.33 32 .960 .057 [.000, .095]Extraversionh 46.66 33 .978 .059 [.000, .096]Agreeablenessi 41.69 33 .971 .047 [.000, .087]Emotional Stabilityj 43.46 32 .981 .055 [.000, .093]

CBN (n = 171)Opennessk 52.15* 31 .955 .063 [.031, .092]Conscientiousnessl 57.11** 34 .932 .063 [.033, .091]Extraversionm 62.57** 33 .956 .073 [.044, .100]Agreeablenessn 48.81* 30 .943 .061 [.026, .091]Emotional Stabilityo 60.47** 31 .962 .075 [.046, .102]

Note. PPP = paper-and-pencil proctored; CBP = computer-based proctored; andCBN = computer-based non-proctored; CFI = comparative fit index; RMSEA = rootmean square error of approximation; CI = confidence interval. Statistics are basedon Satorra-Bentler correction with robust standard errors.

a Covariance for E8E1, E6E3, E7E2, and E4E2 freely estimated.b Covariance for E10E3, E6E2, E4E2, E5E2, and E9E5 freely estimated.c Covariance for E9E8, E7E1, and E6E2 freely estimated.d Covariance for E7E2 freely estimated.e Covariance for E8E7, E3E1, and E10E4 freely estimated.f Covariance for E8E1, E6E3, E4E2, E2E1, and E7E2 freely estimated.g Covariance for E6E2, E7E2, and E10E7 freely estimated.

416 G. Sawhney, K.P. Cigularov / Computers in Human Behavior 36 (2014) 412–421

indicate non-equivalence. Establishing scalar equivalence ensuredthat there were no systematic differences in response bias acrossconditions.

Lastly, invariant uniqueness was assessed if evidence was foundfor scalar equivalence by constraining error variances to be equalfor like items across the paired conditions, in addition to factorloadings, error covariances, and item intercepts (Vandenberg &Lance, 2000). This test assessed whether the items were measuringthe underlying constructs with the same degree of error acrossconditions (Cheung & Rensvold, 2002). The invariant uniquenessmodel was compared again to the error covariance model, with aDCFI value greater than .01 indicating lack of invariant uniqueness.Establishing invariant uniqueness would imply that the personal-ity scales produced equally reliable scores across administrationconditions and could be compared meaningfully at the observedscore level (Vandenberg & Lance, 2000). Finding support for confi-gural, metric, scalar, and invariant uniqueness ME across condi-tions would represent a strict form of ME (Meredith & Teresi,2006).

2.4.3. Latent and observed mean differencesOnce scalar equivalence was established across the three paired

conditions, latent mean differences were assessed by fixing the fac-tor intercept for one of the two paired administration conditions tozero and using z-tests to determine if the other intercept is differ-ent from zero. In addition, we assessed between-condition differ-ences in observed mean scores on the five personality scalesusing independent sample t-tests.

h Covariance for E6E2 and E7E1 freely estimated.i Covariance for E7E2 and E7E5 freely estimated.j Covariance for E8E7, E3E1, and E10E8 freely estimated.k Covariance for E8E1, E6E3, E4E2, and E10E3 freely estimated.l Covariance for E6E2 freely estimated.

m Covariance for E7E1 and E6E2 freely estimated.n Covariance for E7E2, E7E5, E8E3, E7E1, and E6E5 freely estimated.o Covariance for E8E7, E9E6, E10E4, and E3E1 freely estimated.

* p < .05.** p < .01.

3. Results

3.1. Preliminary results

In order to check the success of the randomization procedure,participants in the PPP, CBP, and CBN conditions were comparedwith respect to demographic and background characteristics bymeans of ANOVA and chi-square tests. No significant differencesacross the three samples were found for age (F(2,394) = 1.14,p = .32), gender (v2(2) = 3.62, p = .16), ethnicity (i.e., Caucasian vs.Non-Caucasian; v2(2) = 2.55, p = .28), student status (i.e., freshmanvs. sophomore vs. junior vs. senior; v2(6) = 2.89, p = .82), or workstatus (i.e., working vs. non-working; v2(2) = 1.64, p = .44).

Examination of the Cronbach’s alpha coefficients within eachcondition indicated that all scales were above .70, as recommendedby Nunnally and Bernstein (1994). The average interscale correla-tion coefficients (estimated via Fisher’s z transformation) for theoverall sample, PPP, CBP, and CBN subsamples were .19, .13, .25,and .20, respectively.

3.2. Results for within-group tests of measurement model

First, a one-factor measurement model was tested for each ofthe five scales within each condition (i.e., PPP, CBP, and CBN) inorder to establish baseline models. Due to significant multivariatenon-normality of data (i.e., Mardia’s coefficient values greater than5), the Satorra-Bentler correction was applied to all analyses inorder to provide less biased estimates (Byrne, 2006). Additionally,three cases from the CBN condition that were determined to beoutliers in terms of multivariate normality were deleted from thesubsequent analysis. Initial CFAs for all five personality scalesshowed inadequate fit of the one-factor hypothesized models.

Next, based on the LM test and similarity in item content, errorcovariances were freely estimated for items with similar content toimprove model fit. For the Openness to Experience scale, fourcovariances were estimated for the PPP condition, five for the

CBP condition, and four for the CBN condition. Similarly, covari-ances estimated for the PPP, CBP, and CBN conditions for the Con-scientiousness scale were five, three, and one, respectively.Respecified covariances in the Extraversion scale were three, two,and two, respectively, for the PPP, CBP, and CBN conditions. Forthe Agreeableness scale, one, two, and five covariances were esti-mated for the PPP, CBP, and CBN conditions, respectively. Lastly,three covariances were estimated for the PPP condition, three forthe CBP condition, and four for the CBN condition in the EmotionalStability scale.

Model fit statistics for the respecified CFA models are presentedin Table 2. The respecified one-factor models provided acceptablefit to the data for the five scales across the three conditions. Specif-ically, CFI values for the PPP condition ranged from .924 to .987 andRMSEA values ranged from .031 to .075. For the CBP condition, CFIvalues ranged from .960 to .995, while RMSEA values ranged from.021 to .059. Lastly, the ranges of CFI and RMSEA values for the CBNcondition were from .932 to .962, and from .061 to .075, respec-tively. Given these findings, we proceeded to examine the ME ofthe respecified one-factor models across the three groups.

3.3. Results for between-group tests of measurement equivalence

3.3.1. Configural equivalence resultsThe respecified one-factor models served as baseline models for

the subsequent ME tests, the results of which are presented inTable 3. Results for the configural equivalence model for the PPP

Table 3Between-group tests of measurement equivalence for personality scales across PPP, CBP, and CBN conditions.

Model ME test Modelcomparison

PPPa vs. CBPb PPP vs. CBNc CBP vs. CBN

CFI DCFI RMSEA 90% CI forRMSEA

CFI DCFI RMSEA 90% CI forRMSEA

CFI DCFI RMSEA 90% CI forRMSEA

Openness to Experience1 Configural – .974 – .047 [.000, .077] .953 – .064 [.038, .087] .969 – .052 [.019, .077]2 Metric 2-1 .980 .006 .039 [.000, .069] .943 �.010 .066 [.042, .087] .968 �.001 .049 [.016, .073]3 Error

covariance3-1 .986 .012 .032 [.000, .064] .946 �.007 .063 [.039, .084] .970 .001 .047 [.013, .071]

4 Scalar 4-3 .985 �.001 .032 [.000, .065] .945 �.001 .064 [.040, .085] .968 �.002 .049 [.017, .072]5 Invariant

uniqueness5-3 1.000 .014 .003 [.000, .052] .953 .007 .055 [.030, .076] .975 .005 .040 [.000, .064]

Conscientiousness1 Configural – .945 – .066 [.034, .092] .929 – .068 [.044, .090] .946 – .061 [.035, .083]2 Metric 2-1 .947 .002 .060 [.029, .086] .923 �.006 .067 [.044, .088] .929 �.017 .066 [.043, .086]3 Partial metricd 3-1 – – – – – – – – .937 �.009 .063 [.039, .084]4 Error

covariance4-1 .948 .003 .059 [.027; .085] .924 �.005 .066 [.042, .087] .938 �.008 .062 [.038, .083]

5 Scalar 5-4 .946 �.002 .061 [.029, .086] .923 �.001 .067 [.043, .087] .936 �.002 .063 [.039, .084]6 Invariant

uniqueness6-4 .942 �.006 .060 [.030, .083] .923 �.001 .063 [.041, .083] .945 .007 .055 [.030, .075]

Extraversion1 Configural – .982 – .051 [.000, .079] .969 – .062 [.035, .085] .967 – .068 [.044, .089]2 Metric 2-1 .983 .001 .047 [.000, .075] .966 �.003 .061 [.036, .082] .966 �.001 .064 [.041, .085]3 Error

covariance3-1 .984 .002 .044 [.000, .072] .967 �.002 .059 [.034, .081] .966 �.001 .063 [.040, .083]

4 Scalar 4-3 .982 �.002 .047 [.000, .075] .964 �.003 .061 [.037, .082] .964 �.002 .066 [.043, .086]5 Invariant

uniqueness5-3 .982 �.002 .046 [.000, .072] .963 �.004 .058 [.035, .079] .970 .004 .056 [.032, .076]

Agreeableness1 Configural – .974 – .040 [.000, .071] .957 – .049 [.011, .074] .957 – .055 [.025, .079]2 Metric 2-1 .976 .002 .036 [.000, .067] .959 .002 .044 [.000, .069] .962 .005 .048 [.015, .072]3 Error

covariance3-1 .979 .005 .034 [.000, .065] .962 .005 .042 [.000, .067] .965 .008 .045 [.007, .069]

4 Scalar 4-3 .973 �.006 .039 [.000, .068] .958 �.004 .046 [.006, .070] .963 �.002 .047 [.014, .071]5 Invariant

uniqueness5-3 .971 �.008 .038 [.000, .066] .958 �.004 .043 [.000, .066] .966 .001 .042 [.000, .065]

Emotional Stability1 Configural – .973 – .060 [.025, .086] .960 – .071 [.047, .093] .970 – .067 [.043, .089]2 Metric 2-1 .978 .005 .051 [.006, .078] .960 .000 .066 [.043, .087] .969 �.001 .064 [.040, .085]3 Error

covariance3-1 .979 .006 .049 [.000, .076] .960 .000 .065 [.042, .086] .970 .000 .062 [.039, .083]

4 Scalar 4-3 .977 �.002 .051 [.011, .078] .960 .000 .067 [.044, .087] .969 �.001 .064 [.041, .085]5 Invariant

uniqueness5-3 .980 .001 .045 [.000, .072] .963 .003 .060 [.037, .080] .971 .001 .058 [.034, .078]

Note. ME test = measurement equivalence test; CFI = comparative fit index; RMSEA = root mean square error of approximation; and CI = confidence interval. Statistics arebased on Satorra-Bentler correction with robust standard errors.

a n = 112.b n = 118.c n = 171.d Constraints on two non-equivalent loadings released for CBP vs. CBN comparison.

G. Sawhney, K.P. Cigularov / Computers in Human Behavior 36 (2014) 412–421 417

vs. CBP comparison based on CFI and RMSEA values indicated ade-quate fit for all five scales: Openness to Experience (CFI = .974,RMSEA = .047), Conscientiousness (CFI = .945, RMSEA = .066),Extraversion (CFI = .982, RMSEA = .051), Agreeableness (CFI = .974,RMSEA = .040), and Emotional Stability (CFI = .973, RMSEA = .060).

For the PPP vs. CBN comparison, results similarly showedacceptable fit for all five scales: Openness to Experience(CFI = .953, RMSEA = .064), Conscientiousness (CFI = .929,RMSEA = .068), Extraversion (CFI = .969, RMSEA = .062), Agreeable-ness (CFI = .957, RMSEA = .049), and Emotional Stability (CFI = .960,RMSEA = .071). Finally, the CFI and RMSEA statistics were withinthe recommended range for the CBP vs. CBN comparison: Open-ness to Experience (CFI = .969, RMSEA = .052), Conscientiousness(CFI = .946, RMSEA = .061), Extraversion (CFI = .967, RMSEA = .068),Agreeableness (CFI = .957, RMSEA = .055), and Emotional Stability(CFI = .970, RMSEA = .067). In sum, our results indicated supportfor configural equivalence of the five personality scales acrossadministration conditions.

3.3.2. Metric equivalence resultsGiven the above support for configural equivalence, we pro-

ceeded to examine metric equivalence. In the PPP vs. CBP compari-son, DCFI values indicated that there was not a substantialdecrease in fit for all five scales after constraining the factor loadingsto be equal across the two administration conditions (see Table 3):Openness to Experience (DCFI = .006), Conscientiousness(DCFI = .002), Extraversion (DCFI = .001), Agreeableness(DCFI = .002), and Emotional Stability (DCFI = .005). Results for thePPP vs. CBN comparison also suggested that, compared to the confi-gural model, the imposed factor loading constraints in the metricmodel did not result in a marked decline in fit for any of the scales:Conscientiousness (DCFI = �.006), Extraversion (DCFI = �.003),Agreeableness (DCFI = .002), and Emotional Stability (DCFI = .000).Although the deterioration in fit for Openness to Experience(DCFI = �.010) was at the value determined by Cheung andRensvold (2002), it did not exceed the criterion. Therefore, all equal-ity constraints were retained for further analyses for this scale.

418 G. Sawhney, K.P. Cigularov / Computers in Human Behavior 36 (2014) 412–421

Lastly, results for the CBP vs. CBN comparison showed thatmodel fit did not decrease substantially in the metric model com-pared to the configural model for Openness to Experience(DCFI = �.001), Extraversion (DCFI = �.001), Agreeableness(DCFI = .005), and Emotional Stability (DCFI = �.001). However, amore pronounced decrease in fit was observed for the Conscien-tiousness scale (DCFI = �.017), prompting us to explore partialmetric equivalence (Byrne, Shavelson, & Muthén, 1989). An exam-ination of the modification indices suggested non-equivalent load-ings for two items (i.e., ‘‘I leave my belongings around’’ and ‘‘I makea mess of things’’). After relaxing the equivalence constraints onthe two non-equivalent slope parameters, the partial metric equiv-alence model showed similar fit to the configural model(DCFI = �.009). Overall, with the exception of Conscientiousnessin the CBP vs. CBN comparison, support for metric equivalencewas established.

3.3.3. Error covariance equivalence resultsGiven that a number of error covariances were specified in the

baseline models, tests for error covariance equivalence were per-formed once metric (or partial metric) equivalence was established(see Table 3). Results for the error covariance model for the PPP vs.CBP comparison indicated that model fit did not show a substantialdecrease compared to the configural model, as evidenced by DCFIvalues smaller than .01: Conscientiousness (DCFI = .003), Extraver-sion (DCFI = .002), Agreeableness (DCFI = .005), and Emotional Sta-bility (DCFI = .006). However, a more pronounced decrease inmodel fit was detected for Openness to Experience (DCFI = .012).Interestingly, a closer evaluation of the modification indicesrevealed no non-equivalent error covariances.

Model comparison statistics (i.e., DCFI) did not show a substan-tial decline in fit of the error covariance model in either the PPP vs.CBN or the CBP vs. CBN comparisons: Openness to Experience(DCFI = �.007 and .001, respectively), Conscientiousness(DCFI = �.005 and �.008, respectively), Extraversion (DCFI = �.002and �.001, respectively), Agreeableness (DCFI = .005 and .008,respectively), and Emotional Stability (DCFI = .000 and .000,respectively). This indicated strong support for the error covari-ance equivalence model.

3.3.4. Scalar equivalence resultsResults from scalar equivalence tests revealed that the fit for the

scalar model did not deteriorate in comparison to the less con-strained error covariance model for all three paired comparisonsas displayed in Table 3. In the PPP vs. CBP comparison, DCFI valuesfor the five personality scales were as follows: Openness to Experi-ence (DCFI = �.001), Conscientiousness (DCFI = �.002), Extraver-sion (DCFI = �.002), Agreeableness (DCFI = �.006), and EmotionalStability (DCFI = �.002). When comparing the PPP and CBNadministration conditions, all DCFI values were smaller than .01,

Table 4Descriptive statistics, latent, and observed mean comparison for the Big Five scales across

Scale PPP (n = 112)a CBP (n = 118

M SD z t d M SD

Openness 38.08 6.29 .21 0.36 0.05 37.79 5.8Conscientiousness 35.60 6.21 .22 0.50 0.07 35.16 7.0Extraversion 31.43 8.65 �.01 �0.22 0.03 31.69 9.3Agreeableness 39.61 6.24 .08 0.45 0.06 39.22 6.9Emotional Stability 30.84 7.86 �1.14 �1.29 0.17 32.29 9.0

Note. M = mean; SD = standard deviation; z = z-Test for latent mean differences; t = t-Testdeviation units (effect size).

a Statistical comparisons between PPP and CBP conditions.b Statistical comparisons between CBP and CBN conditions.c Statistical comparisons between PPP and CBN conditions.

* p < .05.

indicating lack of model fit deterioration: Openness to Experience(DCFI = �.001), Conscientiousness (DCFI = �.001), Extraversion(DCFI = �.003), Agreeableness (DCFI = �.004), and EmotionalStability (DCFI = .000). Finally, the scalar equivalence model fitfor each of the five scales did not differ substantially from the lessconstraint model in the CBP vs. CBN comparison: Openness toExperience (DCFI = �.002), Conscientiousness (DCFI = �.002),Extraversion (DCFI = �.002), Agreeableness (DCFI = �.002), andEmotional Stability (DCFI = �.001). Overall, the above results sup-ported scalar equivalence of the examined personality scoresacross the three administration conditions.

3.3.5. Invariant uniqueness resultsSince the scalar equivalence tests indicated equivalent inter-

cepts across the three administration conditions, we proceededwith testing the ME of error variances, or invariant uniqueness.Our results indicated that the drop in model fit after constrainingthe error variances was negligible in all three comparisons, thusproviding evidence for invariant uniqueness (see Table 3). Morespecifically, DCFI values for each of the five scales were withinrange (i.e., less than .01) in the PPP vs. CBP, PPP vs. CBN, and CBPvs. CBN comparisons: Openness to Experience (DCFI = .014, .007,and .005, respectively), Conscientiousness (DCFI = �.006, �.001,and .007, respectively), Extraversion (DCFI = �.002, �.004, and.004, respectively), Agreeableness (DCFI = �.008, �.004, and .001,respectively), and Emotional Stability (DCFI = .001, .003, and .001,respectively).

3.4. Latent and observed mean differences

Finding support for scalar equivalence allowed us to test latentmean differences by constraining the latent mean values of the ref-erence conditions in each comparison to zero (see Table 4). The ref-erence conditions were as follows: CBP condition in the PPP vs. CBPcomparison, CBN condition in the PPP vs. CBN comparison, andCBN condition in the CBP vs. CBN comparison.

Results indicated no significant latent mean differences for thefive personality scales between the PPP and CBP conditions: Open-ness to Experience (z = .21, p = n.s.), Conscientiousness (z = .22,p = n.s.), Extraversion (z = �.01, p = n.s.), Agreeableness (z = .08,p = n.s.), and Emotional Stability (z = �1.14, p = n.s.). Similarly, nosignificant latent mean differences were found in the PPP vs. CBNcomparison for four of the five scales: Openness to Experience(z = .81, p = n.s.), Conscientiousness (z = .36, p = n.s.), Extraversion(z = .83, p = n.s.), and Agreeableness (z = 1.44, p = n.s.). However,the latent means for Emotional Stability differed significantly(z = �2.22, p < .05) with participants in the PPP condition exhibit-ing lower levels of Emotional Stability compared to participantsin the CBN condition. Lastly, latent means in the CBP vs. CBN com-parison were equal for Openness to Experience (z = .74, p = n.s.),

PPP, CBP, and CBN conditions.

)b CBN (n = 171)c

z t d M SD z t d

6 .74 0.75 0.09 37.27 5.68 .81 1.12 0.142 .07 �0.01 0.00 35.17 6.46 .36 0.55 0.079 .78 1.00 0.12 30.63 7.83 .83 0.80 0.101 1.06 1.14 0.13 38.36 5.85 1.44 1.71 0.217 �.79 �0.65 0.08 32.94 8.01 �2.22* �2.18* 0.26

for observed mean differences; d = Difference between observed means in standard

G. Sawhney, K.P. Cigularov / Computers in Human Behavior 36 (2014) 412–421 419

Conscientiousness (z = .07, p = n.s.), Extraversion (z = .78, p = n.s.),Agreeableness (z = 1.06, p = n.s.), and Emotional Stability(z = �.79, p = n.s.).

Observed mean difference results from the t-tests, which arealso displayed in Table 4, were consistent with results from thelatent mean difference tests. No significant observed mean differ-ences were found for the personality scales across the three pairedcomparisons except for Emotional Stability in the PPP vs. CBN com-parison, t(281) = �2.18, p = .03. Participants in the PPP conditionreported lower mean scores of Emotional Stability compared totheir counterparts in the CBN condition.

4. Discussion

Due to the increased use of technology in psychological testadministration, there has been a growing interest in establishingthe measurement equivalence (ME) of psychological measuresadministered in computer-based (CB) and paper-and-pencil (PP)formats (Donovan et al., 2000). However, most of the researchexamining ME across different administration media has focusedon cognitive measures (Mead & Drasgow, 1993). Furthermore, littleis known about the effects of proctoring on ME of psychologicalmeasures. The scant research that has attempted to investigatethe ME of personality measures has mostly failed to use randomassignment or ensure complete anonymity to participants. Addi-tionally, most research has examined weaker (i.e., configural andmetric), rather than stricter (i.e., scalar and invariant uniqueness)forms of ME across different conditions with mixed results (e.g.,Davies & Wadlington, 2006).

To close these gaps, this study investigated the effects of mediaand proctoring on the ME and latent and observed mean levels ofscores on five personality scales measuring Openness to Experi-ence, Conscientiousness, Extraversion, Agreeableness, and Emo-tional Stability. Specifically, administration media effects weretested by comparing personality responses of participants ran-domly assigned to two administration conditions: paper-and-pen-cil proctored (PPP) and computer-based proctored (CBP).Proctoring effects were examined by comparing randomizedresponses across CBP and computer-based non-proctored (CBN)administration conditions.

Overall, the results of this study indicated that despite differ-ences in administration media and proctoring, participantsresponded to the items on personality scales in a similar manner.In support of Hypothesis 1, the results of this study provided evi-dence for strict ME of personality scores across PPP and CBPadministration conditions, suggesting that, when controlling forproctoring effects, type of media (i.e., computer vs. paper) had nodetectable effects on the frame-of-reference with which partici-pants responded to the personality items (i.e., configural equiva-lence), on how the scale metrics were interpreted and used (i.e.,metric equivalence), as well as on the bias and reliability ofresponses (i.e., scalar equivalence and invariant uniqueness,respectively). Additionally, tests of latent and observed meansfound no significant differences between the personality scalescores across the PPP and CBP conditions, lending support toHypothesis 2.

In sum, the five personality scales examined in this study pro-duced scores which were equivalent in terms of psychometricproperties and mean levels regardless of the administration media.These findings are consistent with previous studies that haveexamined ME across PP and CB formats. To be precise, our studycorroborates majority of the previous studies reporting ME of per-sonality measures across formats of test administration (i.e., PP vs.CB; Chuah et al., 2006; Vecchione et al., 2012). At the same time,the findings of our study are discrepant with those of Hirai

et al.’s (2011), who reported non-equivalence across PP and CBconditions. However, inconsistent results may be due to the differ-ences in measures used by the two studies. While the currentstudy utilized a personality measure, Hirai et al. employed scalesof social interaction anxiety and social phobia where participantsare asked to divulge sensitive information. It is possible that dueto the sensitive nature of the scales used by Hirai et al., participantsin the PP condition were less willing to disclose information, thusleading to non-equivalence across the two conditions.

With respect to proctoring effects, our findings revealed thatscores on four of the five personality scales (Conscientiousnesswas the exception) demonstrated strict ME when responses werecompared across proctored (i.e., CBP) and non-proctored (i.e.,CBN) administration conditions, while controlling for mediaeffects, thus, partially supporting Hypothesis 3. This means thatparticipants in proctored and non-proctored conditions used simi-lar frame-of-reference and metrics when responding to personalityitems, as well as provided responses that were similarly unbiasedand reliable.

The Conscientiousness scale demonstrated partial metric equiv-alence, as evidenced by two non-equivalent loadings, suggestingthat participants used the response metric of the two items differ-ently in the proctored vs. non-proctored administration conditions(i.e., ‘‘I leave my belongings around’’ and ‘‘I make a mess of things’’).In other words, these two items appeared to function differently interms of the degree to which they were considered indicative of theConscientiousness construct across the CBP and CBN conditions.Specifically, the two non-equivalent items loaded more highly onthe Conscientiousness scale in the CBP condition than on the CBNcondition. Given the large number of comparisons and the fact thatonly two out of the 50 tested loadings emerged as non-equivalentacross proctoring conditions, future research should replicate theseresults to determine whether these non-equivalent loadings werespecific to our samples or represent a valid departure from metricequivalence deserving further attention. With regards to latentand observed mean difference tests across CBP and CBN conditions,results indicated no proctoring effect on the mean levels of the fivepersonality scale scores, thereby providing evidence for Hypothesis4. These results are consistent with previous research that hasdemonstrated equivalence across proctored and non-proctoredconditions (Davies & Wadlington, 2006).

Additional results also indicated strict ME between responses inthe PPP and CBN conditions, further strengthening the evidencesupporting the comparability of responses to personality measuresacross different administration media and proctoring conditions. Inaddition, with the exception of Emotional Stability scores, no sig-nificant latent and observed score mean differences were foundin personality scores across these two conditions. Interestingly,participants reported significantly lower latent and observed scoremeans on the Emotional Stability scale when they completed themeasure using a PP format in a proctored setting, compared to aCB, non-proctored format. This finding is consistent with resultsreported by Vecchione et al. (2012), who found a lower latentmean for Emotional Stability in their PP, non-proctored conditionthan in the CB, non-proctored condition. It is possible that the par-ticipants in the PPP condition perceived a lower level of anonymityas compared to the CBN condition (Robert et al., 2006), thus evok-ing a stress response. In fact, it has been suggested that individualsprefer CB assessment over PP assessment because the former alle-viates stress (Engelbrecht & Harding, 2004).

4.1. Implications for research and practice

The results obtained in this study have important implicationsfor both researchers and practitioners. Support for strict ME sug-gests that the personality measures under study were mostly

420 G. Sawhney, K.P. Cigularov / Computers in Human Behavior 36 (2014) 412–421

interpreted similarly by respondents across the three administra-tion conditions, and were not affected by response biases(Vandenberg & Lance, 2000). Given that group comparisons aremeaningless unless ME is established (Little, 1997), the currentresults add to the growing body of research providing empiricaljustification for researchers to compare or combine data collectedby personality measures in general, and the Big Five factor markers(Goldberg, 1992) in particular, using traditional PP and recently-popular CB formats. Additionally, for research purposes, it appearsthat data can be collected both in the presence and absence ofproctors without substantially impacting the psychometric proper-ties of the personality scales.

The findings of this study also have implications for organiza-tional assessment practices. Rather than relying solely on PP per-sonality measures for recruitment and selection purposes,organizations can administer CB tests in non-proctored settings,thus enhancing their applicant pool and reducing costs (Tippinset al., 2006). Since proctoring showed a significant effect on themetric equivalence of scores on the Conscientiousness scale, whichis often used for selection purposes (Roberts, Chernyshenko, Stark,& Goldberg, 2005), practitioners should be wary of comparingresponses on this or similar scales across proctored and non-proc-tored conditions without first establishing ME. Furthermore,results for the Emotional Stability scale indicated true mean differ-ences, with lower levels exhibited in the more structured PPP con-dition compared to the CBN condition. This finding may bepertinent to organizations that consider and assess Emotional Sta-bility in relation to staffing decisions.

4.2. Strengths, limitations, and future directions

The present study makes several important contributions to thepersonality measurement literature. This study represents a firstattempt to simultaneously investigate media and proctoringeffects on the ME and mean levels of scores on five personalityscales using strict tests of ME (Vandenberg & Lance, 2000).Although the test of scalar equivalence has been considered suffi-cient to provide evidence for strong ME (Meredith & Teresi,2006), the present study went a step further to establish invariantuniqueness of scores, also considered strict equivalence (Meredith,1993). While evidence of scalar equivalence allows researchers toperform tests of latent mean differences, it is uncommon for prac-titioners (and even researchers) to compare personality meanscores at the latent levels. On the contrary, it is common practiceto test for observed mean differences (Gregorich, 2006), which isjustified only after invariant uniqueness has been established,i.e., showing similar unreliability across conditions or groups(Vandenberg & Lance, 2000). In addition, our study employed ran-dom assignment to administration conditions and anonymous pro-tocol, which ensured equivalence in demographic and backgroundcharacteristics across the three conditions. Majority of the pastresearch has used convenience sampling techniques to collect data(e.g., Davidov & Depner, 2011; Vecchione et al., 2012).

As with most research endeavors, the present study has a num-ber of limitations that need to be considered when interpreting theresults. First, the samples employed for the study comprised ofuniversity students, who participated in research in a universitysetting. It is possible that our findings may not generalize to work-ing adult populations and/or to high-stake testing environments.With respect to population generalizability, Randall and Gibson(1990) suggested that even though use of student samples inresearch studies have received much criticism, it is appropriateto use these samples as long as they are similar to the populationof interest. Student participants in the present study representeddifferent majors, ethnicities, and age groups, and almost 50% ofthe participants in this study were employed. Further, our results

are largely consistent with previous studies of working adults thatfound ME across different medium of administration (e.g., DeBeuckelaer & Lievens, 2009; Mueller, Liebig, & Hattrup, 2007).

Given that the study was carried out for research purposes, thegeneralizability of the results may not extend to high-stake testingenvironments (e.g., selection and promotion; Berkowitz &Donnerstein, 1982). Because the study was conducted in a low-stakes setting where the scores on the completed measures werenot used to make important decisions for respondents, similarresults may not emerge in high-stake settings where individualsmay respond in a more socially desirable manner (Chan, 2009).However, a recent study examined ME of a personality measurein a high-stake, selection setting and found scalar equivalenceacross PP and CB conditions (Vecchione et al., 2012). Interestingly,latent mean differences were found on four of the five scales acrossthe two conditions. Future research should attempt to replicate thestudy results in high-stake environments, such as employment oreducational settings.

Second, the sample sizes for each of the three conditions in thestudy were modest (i.e., nppp = 112, ncbp = 118, and ncbn = 171) andinsufficient to obtain adequate and stable solutions (Streiner,1994), which precluded us from testing the five scales in a five-fac-tor model. Instead, we examined scales individually as one-factormodels. However, to overcome the limitation of small sample size,a robust estimation method was used as per the recommendationof Hu, Bentler, and Kano (1992), who found that the ME resultsproduced by maximum likelihood estimation methods for smallsample sizes were misleading. We encourage future efforts to rep-licate these results by employing larger sample sizes in each con-dition and testing the five-factor model simultaneously.

Lastly, the location of survey completion was not controlled forin the CBN condition. It is likely that participants in this conditioncompleted the survey at different locations. Buchanan and Smith(1999) have voiced concerns of validity in non-proctored CB condi-tions. For instance, they asserted that extraneous factors, such asdistraction, and environmental stimuli, as well as non-stable attri-butes, such as fatigue and intoxication, may influence responses.However, despite the possibility of existence of these factors, thefindings of the present study showed equivalence across condi-tions. Future studies should confirm our findings while controllingfor location in the CBN condition or assessing the effects of differ-ent CBN conditions.

4.3. Conclusion

The present study sought to examine the measurement equiva-lence and latent and observed mean differences of the Big Five fac-tor markers across three administration conditions: paper-and-pencil proctored, computer-based proctored, and computer-basednon-proctored. Participants were university students who wererandomly assigned to one of the three conditions. Overall, findingsprovided support for strict measurement equivalence, indicatingthat media and proctoring had little effect on the psychometricqualities and functioning of the measures under study. In otherwords, respondents tended to use the same frame-of-reference,calibrated the items to the factors similarly, and did not differ interms of response bias and reliability when responding to the mea-sures across the three conditions. Additionally, with the exceptionof one comparison, no significant latent or observed mean differ-ences were found across the three paired conditions. The resultsof the present study are consistent with previous research, whichhas shown support for the ME of personality measures acrossdifferent administration conditions. Future research with largersamples and employees should attempt to extend our findings tohigh-stake contexts, such as employment and educational settings.

G. Sawhney, K.P. Cigularov / Computers in Human Behavior 36 (2014) 412–421 421

References

American Educational Research Association American Psychological Association &National Council on Measurement in Education (1999). Standards foreducational and psychological testing. Washington, DC: American PsychologicalAssociation.

Bartram, D., & Brown, A. (2004). Online testing: Mode of administration and thestability of OPQ 32i scores. International Journal of Selection & Assessment, 12(3),278–284.

Bentler, P. M. (2004). EQS 6 structural equations program manual. Encino, CA:Multivariate Software. (www.mvsoft.com).

Berkowitz, L., & Donnerstein, E. (1982). External validity is more than skin deep:Some answers to criticisms of laboratory experiments. American Psychologist,37(3), 245–257.

Booth-Kewley, S., Edwards, J. E., & Rosenfeld, P. (1992). Impression management,social desirability, and computer administration of attitude questionnaires:Does the computer make a difference? Journal of Applied Psychology, 77(4),562–566.

Booth-Kewley, S., Larson, G. E., & Miyoshi, D. K. (2007). Social desirability effects oncomputerized and paper-and-pencil questionnaires. Computers in HumanBehavior, 23(1), 463–477.

Buchanan, T., & Smith, J. L. (1999). Using the Internet for psychological research:Personality testing on the World Wide Web. British Journal of Psychology, 90(1),125–144.

Byrne, B. M. (2006). Structural equation modeling with EQS: Basic concepts,applications, and programming (2nd ed.). Mahwah, NJ: Lawrence ErlbaumAssociates.

Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for the equivalence offactor covariance and mean structures: The issue of partial measurementinvariance. Psychological Bulletin, 105(3), 456–466.

Chan, D. (2009). So why ask me? Are self-report data really that bad? In C. E. Lance& R. J. Vandenberg (Eds.), Statistical and methodological myths and urban legends:Doctrine, verity and fable in the organizational and social sciences (pp. 309–336).New York, NY: Routledge Sage.

Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes fortesting measurement invariance. Structural Equation Modeling, 9(2), 233–255.

Chuah, S. C., Drasgow, F., & Roberts, B. W. (2006). Personality assessment: Does themedium matter? No. Journal of Research in Personality, 40(4), 359–376.

Cole, M. S., Bedeian, A. G., & Feild, H. S. (2006). The measurement equivalence ofweb-based and paper-and-pencil measures of transformational leadership: Amultinational test. Organizational Research Methods, 9(3), 339–368.

Davidov, E., & Depner, F. (2011). Testing for measurement equivalence of humanvalues across online and paper-and-pencil surveys. Quality & Quantity:International Journal of Methodology, 45(2), 375–390.

Davies, S. A., & Wadlington, P. L. (2006). Factor & parameter invariance of a fivefactor personality test across proctored/unproctored computerizedadministration. In Paper presented at the 21st annual conference of the societyfor industrial and organizational psychology, Dallas, Texas.

Davis, R. N. (1999). Web-based administration of a personality questionnaire:Comparison with traditional methods. Behavior Research Methods, Instruments &Computers, 31(4), 572–577.

De Beuckelaer, A., & Lievens, F. (2009). Measurement equivalence of paper-and-pencil and Internet organisational surveys: A large scale examination in 16countries. Applied Psychology: An International Review, 58(2), 336–361.

Donovan, M. A., Drasgow, F., & Probst, T. M. (2000). Does computerizing paper-and-pencil job attitude scales make a difference? New IRT analyses offer insight.Journal of Applied Psychology, 85(2), 305–313.

Drasgow, F., & Kanfer, R. (1985). Equivalence of psychological measurement inheterogeneous populations. Journal of Applied Psychology, 70(4), 662–680.

Engelbrecht, J., & Harding, A. (2004). Combing online and paper assessment in aweb-based course in undergraduate mathematics. Journal of Computers inMathematics and Science Teaching, 23(3), 217–231.

Evan, W. M., & Miller, J. R. (1969). Differential effects on response bias of computervs. conventional administration of a social science questionnaire: Anexploratory methodological experiment. Behavioral Science, 14(3), 216–227.

Fang, J., Wen, C., & Prybutok, V. (2013). The equivalence of Internet versus paper-based surveys in IT/IS adoption research in collectivistic cultures: The impact ofsatisficing. Behavior & Information Technology, 32(5), 480–490.

Ford, B. D., Vitelli, R., & Stuckless, N. (1996). The effects of computer versus paper-and-pencil administration on measures of anger and revenge with an inmatepopulation. Computers in Human Behavior, 12(1), 159–166.

Goldberg, L. R. (1992). The development of markers for the Big-Five factor structure.Psychological Assessment, 4(1), 26–42.

Gregorich, S. E. (2006). Do self-report instruments allow meaningful comparisonsacross diverse population groups? Testing measurement invariance using theconfirmatory factor analysis framework. Medical Care, 44(11 Suppl. 3), S78–S94.

Gwaltney, C. J., Shields, A. L., & Shiffman, S. (2008). Equivalence of electronic andpaper-and-pencil administration of patient-reported outcome measures: Ameta-analytic review. Value in Health, 11(2), 322–333.

Hirai, M., Vernon, L. L., Clum, G. A., & Skidmore, S. T. (2011). Psychometric propertiesand administration measurement invariance of social phobia symptommeasures: Paper-pencil vs. Internet administrations. Journal ofPsychopathology and Behavioral Assessment, 33(4), 470–479.

Hofer, P. J., & Green, B. F. (1985). The challenge of competence and creativity incomputerized psychological testing. Journal of Consulting and Clinical Psychology,53(6), 826–838.

Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurementinvariance in aging research. Experimental Ageing Research, 18(3), 117–144.

Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structureanalysis: Convention criteria versus new alternatives. Structural EquationModeling, 6(1), 1–55.

Hu, L., Bentler, P. M., & Kano, Y. (1992). Can test statistics in covariance structureanalysis be trusted? Psychological Bulletin, 112(2), 351–362.

Lautenschlager, G. J., & Flaherty, V. L. (1990). Computer administration of questions:More desirable or more social desirability? Journal of Applied Psychology, 75(3),310–314.

Little, T. D. (1997). Mean and covariance structures (MACS) analyses of cross-cultural data: Practical and theoretical issues. Multivariate Behavioral Research,32(1), 53–76.

McKenna, K. Y. A., & Bargh, J. A. (2000). Plan 9 from cyberspace: The implications ofthe Internet for personality and social psychology. Personality and SocialPsychology Review, 4(1), 57–75.

Mead, A. D., & Blitz, D. L. (2003). Comparability of paper and computerized non-cognitive measures: A review and integration. In Paper presented at the 18thannual conference of the society for industrial and organizational psychology,Orlando, Florida.

Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psychological Bulletin, 114(3),449–458.

Meade, A. W., Michels, L. C., & Lautenschlager, G. J. (2007). Are Internet and paper-and-pencil personality tests truly comparable? An experimental designmeasurement invariance study. Organizational Research Methods, 10(2),322–345.

Meredith, W. (1993). Measurement invariance, factor analysis and factorialinvariance. Psychometrika, 58(4), 525–543.

Meredith, W., & Teresi, J. A. (2006). An essay on measurement and factorialinvariance. Medical Care, 44(11), S69–S77.

Mueller, K., Liebig, C., & Hattrup, K. (2007). Computerizing organizational attitudesurveys: An investigation of the measurement equivalence of a multifaceted jobsatisfaction measure. Educational and Psychological Measurement, 67(4),658–678.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York:McGraw-Hill.

Ployhart, R. E., & Oswald, F. L. (2004). Applications of mean and covariance structureanalysis: Integrating correlational and experimental approaches. OrganizationalResearch Methods, 7(1), 27–65.

Ployhart, R. E., Weekley, J. A., Holtz, B. C., & Kemp, C. (2003). Web-based and paper-and-pencil testing of applicants in a proctored setting: Are personality, biodata,and situational judgment tests comparable? Personnel Psychology, 56(3),733–752.

Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: Acomparison of methods based on confirmatory factor analysis and itemresponse theory. Journal of Applied Psychology, 87(3), 517–529.

Randall, D. M., & Gibson, A. M. (1990). Methodology in business ethics research: Areview and critical assessment. Journal of Business Ethics, 9(6), 457–471.

Robert, C., Lee, W. C., & Chan, K. (2006). An empirical analysis of measurementequivalence with the INDCOL measure of individualism and collectivism:Implications for valid cross-cultural inference. Personnel Psychology, 59(1),65–99.

Roberts, B. W., Chernyshenko, O. S., Stark, S., & Goldberg, L. R. (2005). The structureof conscientiousness: An empirical investigation based on seven majorpersonality questionnaires. Personnel Psychology, 58(1), 103–139.

Salgado, J. F., & Moscoso, S. (2003). Internet-based personality testing: Equivalenceof measures and assesses’ perceptions and reactions. International Journal ofSelection and Assessment, 11(2–3), 194–205.

Skitka, L. J., & Sargis, E. G. (2006). The Internet as a psychological laboratory. AnnualReview of Psychology, 57, 529–555.

Stanton, J. M. (1998). An empirical assessment of data collection using the Internet.Personnel Psychology, 51(3), 709–725.

Streiner, D. L. (1994). Figuring out factors: The use and misuse of factor analysis. TheCanadian Journal of Psychiatry, 39(3), 135–140.

Templer, K. J., & Lange, S. R. (2008). Internet testing: Equivalence between proctoredlab and unproctored field conditions. Computers in Human Behavior, 24(3),1216–1228.

Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., et al.(2006). Unproctored Internet testing in employment settings. PersonnelPsychology, 59(1), 189–225.

Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurementinvariance literature: Suggestions, practices, and recommendations fororganizational research. Organizational Research Methods, 3(1), 4–70.

Vecchione, M., Alessandri, G., & Barbaranelli, C. (2012). Paper-and-pencil and web-based testing: The measurement invariance of the Big Five personality tests inapplied settings. Assessment, 19(2), 243–246.

Wise, S. L., & Plake, B. S. (1990). Computer-based testing in higher education.Measurement and Evaluation in Counseling and Development, 23(1), 3–10.