The Philosophy of Psychometrics | IOPS

85
The Philosophy of Psychometrics Denny Borsboom University of Amsterdam

Transcript of The Philosophy of Psychometrics | IOPS

The Philosophy of Psychometrics

Denny Borsboom

University of Amsterdam

Overview

• What is psychometrics?

• Measurement theory

• Psychometric models

• Conceptual issues in latent variable models

I. What is psychometrics?

Psychometrics is not…

• …Item Response Theory

• …the science of how to analyze questionnaire data

• …the study of individual differences

• …a subdiscipline of statistics

• The core business of psychometrics is the formalized treatment of … in psychology (where … may be: “data”, “theories”, “psychological processes”, “constructs”, “sampling”, “statistical inference”)

• However, … usually has something to do with making the connection between theory and data

What is psychometrics?

Central question: How to connect theory to observations?

Scope and task of psychometrics

Psychometrics is a scientific discipline concerned with the question of how psychological constructs (e.g., intelligence, neuroticism, or depression) can be optimally related to observables (e.g., outcomes of psychological tests, genetic profiles, neuroscientific information).

- Borsboom & Molenaar, Encyclopedia of Social and Behavioral Sciences, forthcoming

Central question: How to connect theory to observations?

II. What is measurement?

Measurement theory

Campbell, N. R. (1920). Physics: The Elements. Cambridge University

Press.

Forming a standard sequence

The standard sequence

• Numerical symbols are assigned such that relations between numbers mirror relations between objects

Numerical relations mirror empirical relations

The measurement wars

• In the 1930s, Norman Campbell claimed that (fundamental) measurement always required concatenation

• This meant measurement was impossible in psychology, which lacked concatenation operations

• A committee from the British Association for the Advancement of Science failed to reach agreement on whether psychological measurement is possible

• For more details, see Michell (1999), Measurement in psychology: A critical history of a methodological concept.

The operationalist solution

• Measurement is “the assignment of numerals according to rule”

• Any rule may do!

• The rule followed (“determine equality”, “determine order”, etc.) determines scale levels

• So one is always measuring –the only question is at which level

Stevens, S. S. (1946). On the theory of scales of measurement. Science,

103, 667-680.

The representational solution

• Measurement involves an act of representation

• Measurement attempts to capture the structure of an attribute in symbolic form

• Numeric relations are isomorphic to empirical relations

• Scale levels are defined by the set of transformations that leave the isomorphism intact

Patrick Suppes (1922-2014)

Measurement, transformations, and scale levels

Numerical relations no longer mirror empirical relations

Axiomatic measurement theory

Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. (1971). Foundations of measurement (Vol. I).

Duncan Luce (1925-2012)

Group photo of psychologists who apply axiomatic measurement theory

The “classical” solution

• Measurement is the determination of the ratio between a magnitude and another magnitude of the same kind (the unit)

• Only quantitative attributes allow for measurement

• Testing the hypothesis of quantitative structure is central

• Joel Michell claims that psychometrics ignores this issue

Joel Michell

The psychometric solution

94 Turtles all the way down?

et al., 2004; Woolrich, 2008). Statistics where thresholded using cluster-based correction at z=2.3 and a corrected cluster significance threshold of 0.05 (Worsley, 2001).

Results

Behavioural results Previous studies examining changes in task demands have often divided stimuli up into ‘easy’ and ‘hard’ (e.g. Kalbfleish, van de Meter & Zeffiro, 2007; Perfetti et al., 2009). Here, we model difficulty continuously, so as to better capture the complete parametric space of difficulty offered by the stimuli. To decompose the differential contributions of difficulty and ability in neural response, we fit a Rasch model to the response patterns. A Rasch model is one from a family of Item Response Theory models (IRT, Hambleton, Swaminathan & Rogers, 1991). In the Rasch model, the difficulty of items is related to the ability of participants using a logistic function to predict the likelihood of making an item correctly. Variants of Rasch models are widely used in both fields of general abilities (educational testing, e.g. Bond & Fox, 2013) and specific skills (modelling chess ability, e.g. van der Maas & Wagenmakers, 2005). In the Rasch model we model i dichotomously scored items (1=correct, 0=incorrect) for j persons. Each item has a difficulty parameter β, and each person has an ability parameter θ. The probability that person j with ability θ makes item i with difficulty β correctly can be described by the logistic function shown in Figure 2. We fit a Rasch model in R (Team, 2013) using the package ltm (Rizopoulos, 2006) and eRm (Mair & Hatzinger, 2007). We considered both null-responses (no response within the 30 second time limit) and incorrect responses as incorrect, giving each participant a potential range of 0 to 72 correct. The 34 participants made an average of 39.6 items correct (range: min=19, max= 53, SD=8.8), and took an average of 16143 ms to respond to items (range: min=1199 ms, max=29990 ms, SD= 4240). To best estimate the ability parameter (θ) of each participant given the sample size, we constrained the difficulty parameters (β) of the 72 items based on the Ravens standardization sample (Raven, Court & Raven, 1996). The difficulty parameters of the items ranged from -3.59 to 4.8, capturing a wide range of difficulties.

Figure 2. The 72 Ravens matrices items represented as ranging from easy (green/left) to hard (red/right). Subjects ability is modelled such that their score, theta, most closely corresponds to the probability of them making each of the items correctly, based on their response pattern. The difficulty of an item (beta) can be read off by looking up the position on the X-axis that corresponds to a probability of .5 of making that item correctly (example shown in blue).

Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.

Copenhagen: Paedagogiske Institut.

Charles Spearman: The positive manifold

Spearman, C. (1904). "General intelligence," objectively determined and measured. American Journal of

Psychology 15, 201-293.

Charles Spearman: The concept of a latent variable

Spearman, C. (1904). "General intelligence," objectively determined and measured. American Journal of

Psychology 15, 201-293.

1

X1 X2 X3

1 32

The workhorse of psychometrics

2+2=...Planet:Mars

=Season: ...?

Correlation Correlation

General intelligence

2+2=...Planet:Mars

=Season: ...?

Error Error Error

Reichenbach’s common cause

• C is a common cause of A and B if and only if

1.P(A|C)>P(A|~C) and P(B|C)>P(B|~C)

2.P(A&B)>P(A)P(B)

3.P(A&B|C)=P(A|C)P(B|C)

Number of firemen

Number of paramedics

Number ofspectators

Number offiremen

Number ofparamedics

Number of spectators

Correlation Correlation

Number offiremen

Number ofparamedics

Number ofspectators

Size of fire

Correlation Correlation

Number offiremen

Number ofparamedics

Number of spectators

Size of fire

No correlation

Number offiremen

Number ofparamedics

Number of spectators

Size of fire

Local Independence

Reichenbach’s common cause

• Translating to psychometrics: q is a latent variable for item response vector X if and only if

• These assumptions are known as monotonicity, positive association, local independence, and are typically taken as axioms for the (monotone homogeneous) latent variable model

Relation with measurement theory

Campbell, N. R. (1920). Physics: The Elements. Cambridge University

Press.

mL-mR 0

tilt right tilt left balance

Response Probability

0

1

P(tilt right)

P(tilt left)

From measurement theory to psychometrics

mL-mR 0

P(tilt right) >

P(balance) >

P(tilt left)

Response Probability

0

1 P(tilt right) P(tilt left)

P(balance)

P(tilt right) <

P(balance) >

P(tilt left)

P(tilt right) <

P(balance) <

P(tilt left)

From measurement theory to psychometrics

Markus, K. A., & Borsboom, D. (2013). Frontiers of validity theory: Measurement, causation, and meaning. New York: Routledge.

Successes

Three cheers for latent variables

• Latent variable models are a major success story

• Highly practical

• Testable: can be refuted with the right data

• Models handle measurement error elegantly

• Incorporated in user-friendly computer programs

• Well studied, and work on many models can be considered finished

However…

• It does not seem that psychometric models offer results comparable to those of the natural sciences in terms of prediction and control

• The problem of construct validity doesn’t want to go away

• Most psychometricians are wary to accept a strong causal interpretation of latent variable models

• As Lykken (1993) said: Something seems to be wrong somewhere

III. The philosophy of psychometrics

The philosophy of psychometrics

• The philosophy of psychometrics can involve – the interpretation of psychometrics’

central terms (latent variables, reliability, validity)

– The status of psychometric models per se

– The function of psychometrics in society

• Characteristic focus is on conceptual issues that are not primarily empirical or mathematicalin nature

Zooming in on the latent variable model

• Where does the probabilistic structure come from?

• What is the relation between latent and observed variables?

• What are latent variables anyway? Borsboom, D. (2005).

Measuring the mind: Cambridge: Cambridge

University Press.

P(X=1)

1.0

0.0

Latent variable q

(Extraversion)

Binary response (1: ‘yes’, 0: ‘no’) to the question ‘I am the life of the party’

P(X=1)

1.0

0.0

Latent variable q

(Extraversion)

Binary response (1: ‘yes’, 0: ‘no’) to the question ‘I am the life of the party’

P(X=1)

1.0

0.0

Latent variable q

(Extraversion)

Binary response (1: ‘yes’, 0: ‘no’) to the question ‘I am the life of the party’

John

P(X=1)

1.0

0.0

Latent variable q

(Extraversion)

Binary response (1: ‘yes’, 0: ‘no’) to the question ‘I am the life of the party’

John

P=.10

P(Xij=1)=f(qi-bj)

P(X=1)

1.0

0.0

Latent variable q

(Extraversion)

Binary response (1: ‘yes’, 0: ‘no’) to the question ‘I am the life of the party’

P(Xij=1)=f(qi-bj)

P(X=1)

1.0

0.0

Latent variable q

(Extraversion)

Binary response (1: ‘yes’, 0: ‘no’) to the question ‘I am the life of the party’

Where does the randomness come from?

• The model relates expected values of observables to the latent variable

• Expected values apply to random variables;Gunter Maris (personal communication) has stated that this is the only real axiom in psychometrics

• Why is a person’s response to the item ‘I am the life of the party’ a random variable?

The interpretation of E(X|q)

• Two consistent interpretations are known (Holland, 1990):

• The stochastic subject interpretation views the expected value as a characteristic of an individual

• The repeated sampling interpretation views the expected value as a characteristic of a population

Paul Holland

Lord & Novick: The stochastic subject

Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

The spectacular case of Mr.Brown

‘Suppose we ask an individual, Mr. Brown, repeatedly whether he is in favour of the United Nations; suppose further that after each question we ‘wash his brains' and ask him the same question again. Because Mr. Brown is not certain as to how he feels about the United Nations, he will sometimes give a favorable and sometimes an unfavorable answer. Having gone through this procedure many times, we then compute the proportion of times Mr. Brown was in favor of the United Nations.’

- Lazarsfeld (1959); in Lord & Novick, 1968, pp. 29-30

See also: Borsboom, D., Mellenbergh, G.J., & Van Heerden, J. (2002). Functional thought

experiments. Synthese, 130, 379-387.

Repeated sampling

• The repeated sampling interpretation (Holland, 1990) takes the expected value to apply to a population

• This means that the expected value P(X=1|q)=.10 is interpreted as

In the population of people who have position q on the latent variable, 10% endorses the item ‘I am the life of the party’; conditional on that value of q the probability to randomly draw one of those people is therefore .10’

• This requires a less bizarre thought experiment, but does raise the question what such models have to do with individuals and, for that matter psychology...

Alternatives?

• Bayesian interpretation of probability as degree of belief seems out of the question

• Propensity interpretation (as a “tendency”) is open (but propensities are problematic too)

• I know of no unproblematic interpretations

• If it’s any comfort, however, probabilities are conceptually problematic almost in any context where they occur

P(Xij=1)=f(qi-bj)

P(X=1)

1.0

0.0

Latent variable q

(Extraversion)

Binary response (1: ‘yes’, 0: ‘no’) to the question ‘I am the life of the party’

Causality

• Consider two kinds of causal statements:

– Between subjects: ‘The differences between populations of subjects in their position on the latent variable causes population differences in item response distributions’

– Within subjects: ‘The position of a particular subject on the latent variable causes his or her item responses’

• The first of these formulations may be defended; the second, however, is quite problematic

Between-subjects causality

• Between-subjects causal account can be defended, e.g. via the conditions of Mill (1843):– There is covariation between differences in latent

variable position and differences in item responses– Difference in latent variable values is prior to

difference in item responses (if realism is assumed)– If there is no difference in latent variable position,

there is no systematic difference in item responses

• This is consistent with most accounts of causality, and with the repeated sampling interpretation of expected values

Within-subjects causality

• Has three major problems:

1) The model does not formulate actualcovariation at the individual level

2) Invoking counterfactual covariation requires assuming an uncomfortable amount of structure

3) Generally, the intra-individual causal account is implausible from a substantive viewpoint

1) There is no covariation

• The latent variable varies over, but not within, people

• The individual’s position is considered a constant

• No covariation within subjects is formulated or used

• Conclusion: the model cannot be said to formulate or test causal hypotheses at the level of the individual unless external assumptions are added

2) Counterfactuals assume a lot

• Introduce counterfactual covariation:‘If John had been more extraverted, he would have had a higher expected item response’

• This projects the between-subjects model to the intra-individual space

• It assumes local homogeneity:the within-subjects model is the same as the between-subjects model

2) Counterfactuals assume a lot

• Molenaar (1999) shows the factor model is insensitive to violations of local homogeneity

• A factor model can fit the population, but not fit any single element from that population

• The counterfactual “if John had been more extraverted”… may thus be meaningless at the individual level

Peter Molenaar

3) Substantive plausibility?

So what is the relation between latent and observed variables?

• Perhaps we should think of this relation as causal, but in a different way

• Perhaps between-subjects causality is ok (?)

• Perhaps the relation between observed and latent variable should not be construed as causal at all

• Perhaps the latent variable doesn’t exist except as a mathematical trick or emergent phenomenon, and maybe that’s no problem

The labyrinth of philosophy

P(Xij=1)=f(qi-bj)

P(X=1)

1.0

0.0

Latent variable q

(Extraversion)

Binary response (1: ‘yes’, 0: ‘no’) to the question ‘I am the life of the party’

Alternative I: Behavior domains

• Item responses are not measures, but samples from a behavior domain (McDonald, 2003)

• The object of measurement in psychometric tests is the person’s domain score

• This domain score is a tail-measure (Ellis & Junker, 1997) on the domain

• Inductive move from test scores to domain scores is not causal inference, but generalization

Roderick McDonald(1928-2011)

Alternative I: Behavior domains

Behavior

BehaviorBehavior

BehaviorBehavior

Behavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehaviorBehavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

Behavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehaviorBehavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehaviorBehavior Behavior

BehaviorBehaviorBehavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehaviorBehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehaviorBehavior

Behavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehaviorBehavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

Behavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehaviorBehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehaviorBehavior

Behavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehaviorBehavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

Behavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehaviorBehavior

Behavior

Behavior

BehaviorBehavior

Behavior

BehaviorBehavior

BehaviorBehavior

BehaviorBehavior

Behavior

Behavior

Behavior

Behavior

Behavior

BehaviorBehavior

Here the construct is a score on a domain, fromwhich the test items are a sample

Alternative I: Behavior domains

• Theory applies naturally to (unidimensional) item domains

• Especially useful if the domain can be substantively delineated (e.g., “all addition items “x + y = z”)

• Low key metaphysics (well, relatively speaking)

• But: how credible is the idea of tail measures on an infinite domain?

• What if the domain doesn’t conform to a unidimensional model?

• What if we cannot construct more than a few items?

Alternative II: Networks

• Van der Maas et al. (2006) showed that most of the known facts about general intelligence could be achieved with mutualism

• Recently, Epskamp et al. showed that any network structure generates a set of equivalent mIRTmodels

• Each clique (fully connected subnetwork) generates one latent variable

Han van der Maas

Alternative II: Networks

Alternative II: Networks

Alternative II: Networks

• Theory is attractive to substantive psychologists

• New ways of thinking about the relation between items and constructs

• Estimation procedures and model fitting software are still in development

• Complexity is… complex

• Is the network model really an alternative, or just a different way of thinking about the latent variable model?

Alternative III: Process models

Van der Maas, H. L. J., Molenaar, D., Maris, G., Kievit, R. A., & Borsboom, D. (2011). Cognitive psychology meets psychometric theory: On the relation between process models for decision making and latent variable models for individual differences. Psychological Review, 118, 339-356.

Quality ofneighborhood

Educationallevel

Salary

Socio-economic status

Alternative IV: Forget about (reflective) latent variables

Alternative IV: Forget about (reflective) latent variables

• “Composition models”:– Principal components

– Discriminant functions

– Clustering (e.g., K-means)

• Generally, any model in which a composite is formed that can be exhaustively defined in terms of observables (i.e., behaves as a common effect), is amenable to a formative interpretation

• Some think the latent variable model may have such an interpretation too…

Alternative IV: Forget about (reflective) latent variables

• Pro: Formative models always work

• Con: Formative models always work

• Can be used in the presence of heterogeneous items, even with a Cronbach’s alfa of zero

• Are typically not falsifiable without external (reflective!) measures

• It is not easy to think of formative models as measurement models

Conclusion

• Conceptual questions about psychometric models matter!

• New approaches can arise from critical investigations of existing theory

• The philosophy of psychometrics is still not very well charted and offers many research possibilities

Discussion!

“Latent variables are just a mathematical trick”

“Latent variables like general intelligence are real, causally efficient

variables”

1

X1 X2 X3

1 32

“Psychometrics and psychology are disconnected”

“Psychometrics is a subdiscipline of statistics”

“Models should be as independent of substantive theories as possible”

“Psychometricians should aim for more influence in psychology”