Diploma thesis - IS MUNI

1

MASARYK UNIVERSITY

Faculty of Arts

Psychology

Diploma thesis

Comparison of ipsative and normative measures from the

perspective of their psychometric properties

Supervisor: PhDr. Martin Jelínek, Ph.D.

Brno 2015 Author: Dávid Rédli

2

Declaration

I declare that I worked on this thesis on my own and that I used only sources mentioned in the

References. I agree with storing my work in the library of the Faculty of Arts of Masaryk

University in Brno in order to be available for educational purposes.

Prehlásenie

Prehlasujem, že som diplomovú prácu spracoval samostatne a použil som len pramene uvedené

v zozname literatúry. Súhlasím, aby bola práca uložená na Masarykovej univerzite v Brne

v knižnici Filozofickej fakulty a sprístupnená k študijným účelom.

Brno, 30th April 2015 ………………..

3

Acknowledgement

I would hereby like to thank my parents, who supported me over my studies, helped me and

encouraged me even if I hesitated and were always there for me. I would like to thank heartily

to my bellowed Veronika, who supported me every time I needed and bore with me in good and

bad times. Lastly, I am thankful to PhDr. Martin Jelínek, Ph.D. for his useful advices and

comments, helpful attitude and his tolerance. Thank you all.

4

Table of contents

Table of contents ..................................................................................................................................... 4

Foreword ................................................................................................................................................. 7

1. Introduction ..................................................................................................................................... 8

1.1. A brief history of psychological assessment ........................................................................... 8

1.2. Classical Test Theory .............................................................................................................. 9

1.3. Psychological measurement and types of variables .............................................................. 10

1.4. Normative measurement ........................................................................................................ 11

1.4.1. Disadvantages of normative measures .......................................................................... 12

1.4.2. Response bias ................................................................................................................ 13

1.5. Ipsative measurement ............................................................................................................ 14

1.5.1. Types of ipsative measures ............................................................................................ 15

1.5.2. Problematic properties of ipsative measures ................................................................. 16

1.5.3. Differences between normative measures and ipsative measures ................................. 18

1.6. Psychometric properties of ipsative measures ....................................................................... 19

1.6.1. Statistical methods applicable with ipsative data .......................................................... 19

1.6.2. Untestable reliability of ipsative measures .................................................................... 21

1.6.2.1. Problems with comparing of measures in order to estimate reliability ................. 22

1.6.2.2. Problems with estimating internal consistency...................................................... 23

1.6.3. Factor analysis ............................................................................................................... 25

1.6.4. Cluster analysis .............................................................................................................. 28

1.7. Advantages of ipsative measures ........................................................................................... 29

1.7.1. Reduction of Response Bias .......................................................................................... 29

1.7.2. Moderate responding ..................................................................................................... 30

1.7.3. Decision making in responding to normative vs. ipsative questionnaires ..................... 30

1.7.4. Summary of advantages and disadvantages of ipsative measures ................................. 31

1.7.5. Applicability and use of ipsative measures ................................................................... 32

1.8. Summary of ipsative measurements ...................................................................................... 33

1.9. NEO personality inventory .................................................................................................... 33

1.9.1. History of Big Five Model and NEO inventory ............................................................ 33

1.9.2. Description of the Big Five personality traits ................................................................ 35

1.9.3. Psychometric properties of NEO-FFI ............................................................................ 36

2. Hypothesis ..................................................................................................................................... 37

3. Method........................................................................................................................................... 38

3.1. Administration ....................................................................................................................... 38

5

3.2. Creating an ipsative version of normative NEO-FFI ............................................................. 39

3.2.1. Grouping the items ........................................................................................................ 40

3.2.2. Determining the maximum points to be distributed in groups in form B ...................... 41

3.2.3. Transformation of negative questions ........................................................................... 42

3.3. Experimental design .............................................................................................................. 43

3.4. Respondents........................................................................................................................... 43

4. Results ........................................................................................................................................... 44

4.1. Total data ............................................................................................................................... 44

4.2. Normative data ...................................................................................................................... 44

4.2.1. Description of normative data ....................................................................................... 44

4.2.2. NEO-FFI Results ........................................................................................................... 45

4.2.3. Internal consistency ....................................................................................................... 46

4.2.4. Factor analysis ............................................................................................................... 48

4.2.5. Reliability - test-retest group ......................................................................................... 48

4.3. Ipsative data ........................................................................................................................... 50

4.3.1. Description of ipsative data ........................................................................................... 50

4.3.2. NEO FFI Results – Ipsative ........................................................................................... 50

4.3.3. Internal consistency ....................................................................................................... 53

4.3.4. Factor analysis ............................................................................................................... 55

4.3.5. Cluster analysis .............................................................................................................. 55

4.3.6. Reliability - test-retest group ......................................................................................... 56

4.4. Comparing ipsative and normative data ................................................................................ 57

4.4.1. Graphical representation of relations between Ipsative and normative data ................. 58

4.4.2. Correlation coefficient ................................................................................................... 60

4.4.3. Comparison of correlations in test and re-test in various groups .................................. 61

4.4.4. Analysis of items – reliability of separate items ............................................................ 62

4.4.5. Comparison of final rank results ................................................................................... 63

4.5. Variability of total data .......................................................................................................... 64

4.6. Social desirability .................................................................................................................. 66

5. Discussion ..................................................................................................................................... 67

5.1. Limitations of study ............................................................................................................... 67

5.1.1. Respondents ................................................................................................................... 67

5.1.2. Administration through internet .................................................................................... 68

5.1.3. Qualitative analysis – some comments from respondents ............................................. 68

5.1.4. Distribution of points ..................................................................................................... 69

5.2. Properties of the semi-ipsative and normative measure and applicable statistics ................. 70

6

5.2.1. Ipsativity of the hybrid measure .................................................................................... 70

5.2.2. The similarity of the two forms ..................................................................................... 70

5.2.3. Applicability of methods for statistical analysis ............................................................ 71

5.2.4. Reliability of the semi-ipsative vs. normative scale ...................................................... 72

5.2.5. Advantages of the semi-ipsative measure ..................................................................... 73

5.3. Improvements ........................................................................................................................ 74

5.3.1. Testing the validity of two forms .................................................................................. 74

5.3.2. Adjustment of design in order to reveal response bias .................................................. 74

5.3.3. Use of same scale for ipsative and normative data ........................................................ 75

6. Conclusion ..................................................................................................................................... 76

References ............................................................................................................................................. 77

List of tables .......................................................................................................................................... 81

Attachments ........................................................................................................................................... 82

1. Factor Analysis Normative data - Rotated Component Matrix ............................................. 82

2. Factor Analysis Ipsative data - Rotated Component Matrix ................................................. 83

3. Example of statistics used for Item analysis .......................................................................... 84

4. Inter-item correlation table .................................................................................................... 87

7

Foreword

“Data will not object. That is why they are misused so often”

prof. PhDr. Tomáš Urbánek, Ph.D.

The above quote of my teacher of Methodology in Psychology precisely describes the current

situation of misuse of psychological measures in practice. Based on my own experience with

the use of dubious measures I decided to dedicate my Diploma thesis to discussing properties

of such measures. When I came over a certain personality inventory that used the ipsative

format I was at first enlightened to see that there is an alternative to normative measures.

However, the more I learned about ipsative measures, the more questions about their

appropriateness arose. Since the literature on this topic was not conclusive, I wanted to test it

personally. Therefore the following study was conducted.

8

1. Introduction

1.1. A brief history of psychological assessment

The establishment of psychometrics is accounted to Sir Francis Galton and its beginning is

dated in the second half of the 19th century (Rust, 2008).1 Historians of psychology consider

Galton the “father of mental testing”, because of his attempts to create the first tests measuring

psychological attributes (Boring, 1950). Another important researcher was James McKeen

Cattell. Cattell worked on studying individual differences and he Introduced the term “mental

test”2 (Gregory, 1991). Galton’s and Cattell’s attempts can be considered the first “intelligence

test”. Especially Cattell developed a set of measurements that were supposed to predict

intelligence by measuring simple psychological stimuli (Urbánek, Denglerová, & Širuček,

2011). However, this approach was proven unsuccessful by Wissler in 1901.

Later, in 1905 Simon and Binet created the Binet-Simon scale for measuring “mental age”. In

1912 William Stern created the index known as “classical intelligence quotient”. Since then the

era of testing intelligence followed, which gave rise to the first psychometric instruments. The

Binet-Simon scale was many times revised and adapted. It was first used to test children in

schools, but when WWI begun, special adaptations of it were created for the army as a selection

tool for recruits.3 The measurement of IQ as we know it today began with the construction of

Stanford-Binet IQ test in 1916 (Gregory, 1991).

After the increasing use of intelligence tests, the focus was aimed also on personality traits and

thus personality questionnaires started to emerge. The first of its type was the Woodworth

Personal Data Sheet published in 1919, followed by Thurston’s Personality Schedule in 1930,

Allport-Vernon´s Study of Values in 1931, Minnesota Multiphasic Personal Inventory (MMPI)

in 1943, Cattell’s 16 Personality factors Questionnaire in 1949, Myers-Briggs Type Indicators

in 1944, and many others (Gregory, 1991).

Hand in hand with the boom of testing in the beginning of 20th century, the need for statistics

to describe the tests mathematically rose steadily. That is why many researchers especially in

the 1940ties and 1950ties focused on constructing psychological measures, searched for new

methods to analyse these measures and the results obtained and tried to predict properties of

1 The birth of psychometrics by many authors connected to the publication of the book „Inquiries into Human Faculty and Its Development“ by Sir Francis Galton in 1983 (Urbánek, Denglerová, & Širuček, 2011) 2 This term was introduced in his paper „Mental Tests and Measurements“ published in Psychological journal Mind in 1890. (Gregory, 1991) 3 For this purpose the adaptations of intelligence test Army Alpha and Army Beta were created by R. Yerkes.

9

such measures (Urbánek, Denglerová, & Širuček, 2011). This gave rise to modern

psychometrics, which is about procedures used to “estimate and evaluate the attributes of tests”

(Furr, 2014).

1.2. Classical Test Theory

Since the emergence of intelligence tests the main approach for evaluating the test have been

Classical Test Theory (CTT)4. This approach emerged upon 3 important achievements:

recognition of presence of errors as a random variable and conception of correlation by Charles

Spearman in 1904; publication of Kuder-Richardson formulas for estimating reliability and the

idea of lower bounds to reliability; and lastly by a systematic treatment of CTT by Melvin

Novick in 1966 (Traub, 1997).

The CTT is based on the proposition that the observed score random variable consists of the

latent trait (or true score) and a measurement error. The measurement error is a random variable.

Furthermore, the measurement error is considered to have zero covariance with the latent trait,

ergo they are independent of each other. Next, the error component of a measure is independent

of the error components of other measures (Novick, 1965).

The results of CTT, or rather the means by which tests are evaluated, are coefficients of

reliability and standard errors of measurement (Traub, 1997). As for reliability, it is the

consistency of measurement or stability of measurement over a variety of conditions in which

the same results should be obtained. It is estimated by using test-retest, split-half or parallel

forms of a test and by correlating the results of one group with results obtained in a different

measurement from the same test. The correlation is estimated by using Pearson’s product

moment correlation coefficient (Pearson’s R). Another estimate of reliability is internal

consistency, which is measured by Cronbach’s Alpha. Alpha is a statistical method introduced

by Lee Crombach in 1951 and it represents the expected correlation of two tests that measure

the same construct (Drost, 2012). All the statistical instruments used in estimating reliability

stem from the basic propositions of CTT.

4 Even though currently other approaches are gaining on popularity. One of the most influential recent approach that could in the near future replace CTT is Item-Response Theory.

10

1.3. Psychological measurement and types of variables

In order to understand the mathematical background of psychological measures the concept of

measurement as such must be described.

Measuring is a process of assigning arbitrary numbers to attributes that are connected to a

specific theory (Hendl, 2004). In psychology indirect measuring is used, which means that

numbers are not assigned by direct comparison to a scale, but by using third variables, which

help estimate the correct value.

This occurs using several types of scales of measure. These scales differ from each other

depending on how well they convey the measured information into real numbers and what

operations can be done with the measured numbers. Currently there are 4 types of scales used

in social sciences according to Stevenson’s typology (Stevens, 1946): nominal, ordinal, interval

and ratio.

Nominal scale differentiates between items based only on their categories. They can assign an

item to a certain category or other. With nominal scales only the number of types of scales can

be determined – we can have either dichotomous data (male/ female), or polynomial (language).

Ordinal scales have an additional feature of ordering. They also assign categories to objects but

furthermore they can also be ordered according to a certain criterion. Ordinal scale data can be

arranged but cannot be compared – since we do not know the differences between them. As for

the statistics, central tendency indicators such as median and mode can be used, but arithmetic

mean would again provide uninterpretable results. Some authors claim that all psychological

questionnaires use this type of data, since they measure opinions on peoples cognitive or other

abilities and at present there is little evidence to suggest that such attributes are more than

ordinal (Michel, 2008).

Interval scale has all the attributes of the previous scales and in addition it allows for degree of

difference between items. Also, there is no absolute zero in interval data, therefore they cannot

be used for ratios (20 degrees is not twice as much as 10 degrees). As for their mathematical

description, the central tendency can be estimated using mode, median and arithmetic means.

The statistical dispersion includes range and standard deviation, but measures that require ratios

(such as coefficient of variation) cannot be used. Furthermore, it is possible to define

standardized moments, since ratios of differences are meaningful.

11

As for rational scale, it includes the characteristics of all previous types plus ratios can be made.

This is possible due to the fact, that rational scales include an absolute zero (non-arbitrary), and

thus the unit magnitude is given in a continuous quantity. With ratio scale data all statistical

measures are allowed, because all necessary mathematical operations are defined for the

rational scale.

1.4. Normative measurement

Normative measurement is such, in which “subjects are placed in order relative to one another

and assigned a standard score in term of the population distribution” (Cattell, 1944).

The creation of intelligence tests in the beginning of 20th century meant also the beginning of

use of normative data and standardization in psychological assessment. In psychological

assessment the concept of inter individual differences is of crucial importance. In researches in

social studies scientists try to understand differences among people or among groups.

Variability is also fundamental to psychological measurement, since it is based on the

assumption that psychological differences exist. According to Furr (Furr, 2014) all research in

psychology depends on the ability to measure inter-individual differences. He also states, that

“psychometric concepts as reliability and validity are entirely dependent on the ability to

quantify differences among people”. By inspecting the differences of individuals and their

variability norms and models can be obtained that represent the whole population. Then,

normative measures are such that compare the raw scores of the individual to a theoretical score

of the population and thus assign individuals relative positions in the population. In other words

normative means relating to an ideal standard or model.

The normativity is given by a function called normal distribution defined by Pierre-Simon

Laplace and adjusted by Carl Friedrich Gauss (Howell, 2013). In psychology it is assumed that

the values of personality traits in the population can be described by this inverted bell shaped

curve. This function assumes that every trait has such a distribution in the population that most

people have an average value of the trait and with the increasing (and decreasing) value of the

trait the frequency of it decreases. In other words, there is a certain population mean of each

trait, that is the “norm”, simply because most people are like this norm. Deviations to one or

the other direction from this norm means higher or lower value of the trait than the normal

value. Since the normal value is the value occurring in most people, it is logical that there will

12

be less people with values below or above average. Also, the more the values are below or

above average the less people will have those values. These characteristics are essential for

inter-individual comparison.

Assuming the amount in which normal distribution is used, the past century can be considered

a “normative paradigm” in psychological measurement. The advancement of psychological

measures in the beginning of the 20th century facilitated the application of CTT, which operates

nearly solely on normative data. However, this is not the only reason why normative measures

are still used. As the name indicates, the results of individuals can be compared to a norm. This

is an extremely useful characteristic, since thanks to the normalization of data researchers can

compare individuals in respect to the amount of their latent personality traits. Such comparisons

are used for example in occupancy psychology for selection of appropriate candidates based on

their scores compared to others, in clinical psychology to distinguish “normal” people from

“pathological” etc.

1.4.1. Disadvantages of normative measures

On the other hand, normative tests have also a number of drawbacks. Firstly, in order to apply

normative measures, the “norms” must be detected. To explain, after obtaining the raw scores

of respondents in a test it is necessary to transform these raw scores to normative scores (scores

that can be displayed on the normal distribution function) and thus estimate the value of the

measured traits as compared to the values of the population (referred to as “parameters”)

(Cattell, 1944). The problem is that in order to estimate the individuals’ position within the

population the values of the whole population must be considered. Obviously it is not possible

to test every person in the world (or even in a country) to estimate these norms. Therefore,

psychologists use “sampling” (Emmel, 2013). They select a representative sample from the

population to estimate the norm of the population. Then the results are transformed into the

standard normal distribution (this is called standardization) (Geisinger, 2012).

There are several issues with sampling, but the most difficult task is to select a truly

representable sample. As Furr mentioned (Furr, 2014), there are entire books written on this

issue and therefore it is not in the scope of this study to cover. However, it has to be pointed out

that usually the samples are very limited, since there is not enough time nor money to test huge

amounts of people. For example, one of the most used personality questionnaires (NEO-FFI)

13

was tested on approximately 2000 people in the Czech Republic (Hřebíčková, 2011). This

seems to be a considerably high amount of respondents, but when expressed absolutely it

represents data from only 0.05 % of the population of the Czech Republic. Obviously, it is very

ambitious to describe the population based on data from 0.05% of its members.

The selection of the sample is important, because the norms can be distorted depending on the

selection of respondents for the standardization. Also, the norms must be updated in order to be

precise, since the true scores may change in the population over time. Having said that,

normative measures are limited by the appropriateness of sample in terms of their precision.

1.4.2. Response bias

Another problem connected directly to personality inventories is response bias. It is defined as

a “systematic tendency to answer to test items in a certain way, which interferes with the exact

picture of self” (Paulhus, 2002). Since response bias negatively influences the psychometric

properties of a test (mainly validity and reliability) psychologists showed increased interest in

methods to eliminate it. Furr (2014) extensively describes a number of different response biases

and proposes ways to decrease their effects.

One of the most problematic response biases is “social desirability”. It means that people tend

to answer the test questions in a way they think is desirable, without taking into account the

truth (Kubička, Csémy, 1999). This is especially problematic in applied psychology for

example in selection procedures, when people can make very specific predictions about what

personality traits or behaviour are desirable for the position they are applying for. This way, the

results can be distorted and will not describe the personality of the respondent, but merely his

opinion of what is expected of him. This is an example Paulhus (2002) describes as “impression

management”. The other process leading to social desirability according to Paulhus is self-

deception, in other words having unrealistic views of self. In a meta-analysis by Ones et al

(1999) it was shown that respondents can raise scores on a normative scale by 0.5 to 1 standard

deviation and lower even more, when asked to do so.

Another important response bias is “moderate responding”. According to Furr (2014) it is a

tendency of respondents to reply with around the average values. It is connected to the fact that

respondents avoid extreme answers (Baron, 1996). As a result the data is clustered around the

14

mean and is problematic to interpret correctly. Moderate results are usually of little value to the

examiner, since average results in a personality test does not say much about one’s personality.

As a treat to response bias Furr (2014) summarizes several techniques to eliminate its effects.

These include reducing situational factors that can elicit socially desirable responding, use of a

balanced scale of positively and negatively keyed items, use of special validity scales or use of

forced-choice items. The last option leads us to a different area of psychological measures,

namely ipsative measurement.

1.5. Ipsative measurement

Cattell (1944) used the term “ipsative” when “scale units were designated relative to other

measurements on the person himself” (Latin ipse = he, himself) (Cattell, 1944). A more

eloquent definition was provided by Hicks (1970), who states that “ipsative measurement yields

scores such that each score for an individual is dependent on his own scores on other variables,

but is independent of, and not comparable with the scores of other individuals”5. Hicks (1970)

also stated that it is typically tests using forced choices between scales (such as preferential-

choice, paired-comparison or other) or ranking of scales that result in ipsative measurement6.

This is understandable, since the nature of ipsative measurement is to estimate the preferences

of the individual (or order of some traits).

The beginning of ipsative measurement dates back to the creation of first psychometric tools

for measuring values (e.g. Allport-Vernon-Lindzey Study of Values published in 1931

(Kopelman & Rovenpor, 2006). However, they became more popular in the 1950ties, where

also a fierce discussion on their applicability and psychometric properties began. Even then,

psychometricians were aware of the limitations of ipsative data and as a consequence recent

researchers still refer to papers from Cattel (1944), Guilford (1954), Clemans (1966) and Hicks

(1970) while further researching the limitations of ipsative measures.

Some researchers gave up on ipsative data, because from the beginning it was obvious that their

validity is dubious and reliability untestable. Others consider ipsative data more real than

5 Nowadays there is a certain terminological chaos in what exactly ipsative means. For example according to Paul Vogt (Vogt, 2011) a test is ipsative if its goal is to rank orders in a way that no rank can be used twice. Such measure would by definition gain the same means, medians and standard deviations. 6 The ipsativity of a test can be reduced in various ways. A list of ways was proposed by Hicks (Hicks, 1970)

15

normative, because of the decision processes included in the choices, namely choosing

preferences (Tamir & Lunetta, 1977).

As Johnson et al. put it in their astounding study “Spurionuser and Spuriouser: the use of

ipsative personality tests” (1988) the problems of ipsative tests were well documented, but most

textbooks on psychometrics ignore this topic. This was true in 1988 and is unfortunately true

even now in 2015. In fact, as the title indicates Johnson et al. attempted to warn researchers

from misuse of ipsative format, since as they noted more and more personality test were built

on ipsative bases without realising the dangers and limitations of ipsative measures. On the

other hand, they did not disregard the ipsative format as such, they merely stressed that ipsative

must be evaluated carefully and warned that such data cannot be evaluated like normative data.

1.5.1. Types of ipsative measures

Over the years social scientists proposed a number of typologies of ipsative measures. First

Cattell (1944) differentiated between Simple ipsative, Ratio ipsative, Fractional Ipsative,

Normative Ipsative and he also proposed an Ipsative Normative category, which was

categorised as a normative measure with ipsative elements. Because of terminological

inconsistencies this typology is not used nowadays.

The most coherent differentiation was conducted by Hicks, who recognised purely ipsative and

partially ipsative measures. Under purely ipsative he understood measures, in which the sum

of scores is a constant. However, items are not purely ipsative if respondents only partially

order item alternatives rather than ordering them completely (Waters, 1964). Even though Hicks

did not explicitly name this category of non-ipsative measures, we might consider them “semi

ipsative” (since their ipsativity is only decreased). Under partially ipsative, Hicks (Hicks, 1970)

understood measures that fulfil the less strict criterion of ipsativity proposed by Guilford (1952),

which considers ipsative measures as such, in which a score elevation on one attribute causes a

score depression on a different attribute or attributes.

16

1.5.2. Problematic properties of ipsative measures

One of the most notable characteristics of ipsative measures is that they are supposed to reflect

only relative strengths of traits within an individual. This is best described by (Cornwell &

Manfredo , 1994) stating that “ipsative scores and ipsative profiles of attributes can convey

distinctiveness among individuals, but are not measurements of quantity or degree of

attributes”. The relativity of ipsative scales is demonstrated in more ways. Firstly, in ipsative

measures respondents have to order certain statements (or rank them) from one they most agree

with to the least agreed. They, however, do not indicate how much they agree with the

statements. Secondly, the result of such measurement is a profile of preferences. It only shows

what the respondents prefer in comparison to some other variable, but does not indicate how

much they he like it.

Next, probably the strongest ipsative property is that scale scores for an individual always add

to the same total (Johnson, Wood, & Blinkhorn, 1988). According to some researchers this is

actually the defining element of ipsativity. It means that all subjects have the same total score

on each scale. Because the sum of the scales in a test is a constant, any one scale score is

predictable from the remaining scale scores. As a result, there must mathematically be negative

inter-correlations among the scores. This forced negative dependence causes that ipsative scales

are non-independent and thus cannot be evaluated using the same psychometric methods as

normative scales (Cornwell & Dunlap, 1991).

A different problem connected to the items that sum up to a constant is that individuals having

extremely high or low latent trait values will end up with the same results. The resulting profiles

would be the same even if one respondent’s true scores were on the high end of the distribution,

and the other’s scores were at the lower end. Baron (1996) attempted to show that this is not

true for all ipsative measures. He suggested that when using measures with a high number of

scales (30 and more) the probability of having a person who would have extremely high true

scores is less than one in a hundred million. Another argument presented by Anna Brown

(Brown, 2010) is that most people have around mean values, so extreme values are rare. Both

these arguments appear to be alibistic, because even though they might be true, their practical

application is questionable.

Another property of ipsative measures is that they might lead to unwanted or even untrue results

by forcing a respondent to choose between items. As Mead (2004) proposes, if respondents are

17

forced to choose between certain statements, it can occur that they will have very low true

scores for each trait, but still they will be forced to choose one (or more) that they agree with.

E.g. If a respondent have to choose which describes him better, agreeable or hard-working, but

none of these describe him, he will still have to indicate one. The resulting answer will not be

different from if he was presented with two statements, from which one described him very

well and the other not at all, nor from the situation when both properties described him very

well.

Next, Mead described a confounding variable in tests known as item threshold, which is similar

to the concept of item difficulty, only item threshold can distort results severely in ipsative

measures. Put simply, some items measuring Extraversion need more latent trait in order to be

chosen, where other items will need lees of the latent treat. Therefore if two items put in a set

of items have different thresholds, they will distort the resulting ranks.

Finally, because of all the above mentioned properties, ipsative measures have constrained use

of statistical tools for evaluation. Generally, the ipsative measures do not fulfil the basic

assumptions of CTT and therefore it is problematic to estimate the reliability and validity of

ipsative measures.

The following points summarize the problematic properties of ipsative data according to

Johnson et al. (1988):

1. They cannot be used for comparing individuals on a scale by scale basis;

2. Correlations amongst ipsative scales cannot legitimately be factor analysed in the usual

way;

3. Reliabilities of ipsative tests overestimate the actual reliability of the scales;

4. The whole error is problematical, and thus reliabilities are troublesome;

5. Validities of ipsative tests overestimate their utility;

6. Means, standard deviations and correlations derived from ipsative tests scales are not

independent and cannot be interpreted and further utilized in the usual way.

18

1.5.3. Differences between normative measures and ipsative measures

The biggest difference between ipsative and normative measures is that normative measures

use the ranking of individuals within a group on a specific personality trait, whereas ipsative

measures use the ranking of specific abilities within an individual regarding strengths and

weaknesses (Cornwell & Dunlap, 1991). While normative data can be arranged using

parameters of the population, this is not true for ipsative data (they can be arranged only using

data from the same individual).

Next, Cornwell states that ipsative scales cannot substitute for normative scales. The reason is

because ipsative measures include the ranking of the individual’s abilities, but creating a list of

preferences does not include any information regarding that individual’s strengths and

weaknesses on the abilities measured. On the other hand, normative data does not compare the

abilities as such, but gather information on the absolute values of these abilities (Cornwell &

Dunlap, 1991). Furthermore, Cattel (1944) pointed out that ipsative scores and normative scores

are not interchangeable. Purely ipsative results cannot be transformed to normative scores and

similarly, purely normative scores cannot be transformed to ipsative scores, since they exist in

different “universes”.7

Next, Closs (1996) stated that it is not possible to validly use ipsative data for inter-individual

comparisons. In this argument he assumes that for inter-individual comparison the raw score of

an individual must be converted to percentiles, stanines or other standardised values. However,

his study showed that ipsative results differed greatly from normative results after they were

standardised. In this study he used the JIIG-CAL Occupational Interests Guide, which is both

an ipsative and normative test widely used in the UK. The normative part consists of assigning

one of the values “agree, neutral, disagree” to statements presented. The ipsative part consisted

of stating, which from a pair of statements, does the respondent prefer. This test design allowed

for Closs to directly compare ipsative and normative data. His results were that the percentiles

obtained from the ipsative form were entirely different from those in normative form (these

results were also confirmed by Cornwell and Dunlap (1991) in their study). Closs also showed

that ipsative measures created negative correlations between scales, even though normative

7 Beyond the scope of this work it must be mentioned that recent researchers attempted to estimate normative results from ipsative scores using Item Response Theory and Coombs extended idea of unidimensional unfolding to a multidimensional model (Mccloy, Heggestad, & Reeve, 2005)

19

scores were clearly positively correlated. Therefore he concluded that normative interpretation

should never be used with ipsative data.

Last, Hick (1970) summed up the properties of ipsative data described statistically by Clemans

in his extensive paper (Clemans, 1966). His paper was later cited by many researchers and his

findings were all confirmed. Obviously, none of these properties apply to normative data:

1. The sums of the columns or rows of an ipsative covariance matrix must equal zero;

2. The sums of the columns and rows of an ipsative inter-correlation matrix will equal zero

if the ipsative variances are equal;

3. The average inter-correlations of ipsative variables have -1/(m – 1) as a limiting value

where m is the number of variables;

4. The sum of covariance obtained between a criterion and a set of ipsative scores equals

zero;

5. The sum of ipsative validity coefficients will equal zero if the ipsative variances are

equal.

1.6. Psychometric properties of ipsative measures

1.6.1. Statistical methods applicable with ipsative data

The task to identify statistic methods that can be used to assess ipsative measures is very

difficult. The first problem is that the term “ipsative” is very broadly defined, and it includes

several types of questionnaires or tests that use very different ways for collecting data. This also

causes a problem when different researchers conduct studies using different types of measures

(tests) and try to address the problematic of ipsativity in general. It must be noted that similarly

like in normative measures, some tests methods are better than other and negative attributes of

test should not be without further consideration assigned to the ipsativity of a measure.

To begin with, the type of variable that can be obtained using ipsative measures cannot be

higher than interval. However, most researchers claim, that ipsative measures can only obtain

ordinal type of data. The properties of ordinal data does not allow for central tendency estimates

such as means and the use of medians or modes is also questionable. For example Baron in his

study (1996) states that ipsative data constitute only ordinal level of measurement. Therefore

20

he came to the conclusion that such data do not meet the criteria for standard parametric

analyses. It must be noted that Baron also claims that normative data are not true interval level

scales either, since the difference between agree and strongly agree is not the same as between

disagree and neither agree nor disagree.

Even though this Barons argument is generally true, by summing the results of items for scales,

more than ordinal data are achieved. The total scores can be ordered and what is more they will

also quantify distances between the averages of scales. Therefore the total scores can be

averaged and variance can also be achieved. On the other hand, it is true that the nature of these

total scores are of question, since we sum relative scores (not absolute). It is not clear whether

we can achieve absolute scores by summing relative scores. This particular issue was addressed

by Vries (2008) who also claims that summation of scores in ipsative measures produce

uninterpretable test scores. Therefore he proposed alternative scoring methods a weak and a

strict rank preserving scoring method, which both allow an ordinal interpretation of test scores.

Next, because ipsative data are relative, it is difficult to compare individuals’ scores. According

to Cornwell and Manfredo (1994), the only between-subjects comparison that can be used with

ipsative scored variables is to consider them as categorical. Therefore they proposed that for

example contingency table analysis can be used.

Furthermore, after considering the inter-dependencies scales of ipsative measures, one must

arrive at the same conclusion as Johnson et al.: “correlations of any sort, between ipsative scales

are uninterpretable, because scales are mathematically interdependent”. Therefore, any

method that relies on correlations or analysis of correlation matrices is unacceptable and

unusable with ipsative data. This way Johnson et al. ruled out partial correlations, multiple

correlations, and multiple regressions, reliability coefficients using correlation, discriminant

analyses, cluster analyses and factor analyses.

On the other hand as for more complex methods, such as factor analysis according to Guilford

(1952) and supported also by Johnson et al. (1988) it should be possible to apply the Q factor

analysis on ipsative data and receive relevant results.

Next, Cornwell and Manfredo (1994) proposed that ipsative data can be analysed using

multinomial statistical techniques, more specifically they used multinomial logistic regression

to regress four learning style categories from Kolb´s Learning Style Inventory to intelligence.

21

Barrett and Hammond (1996) used principal component decomposition as an alternative to

factor analysis. For analysing the correlations between normative and ipsative measures they

used the Multi trait multi method analysis, in a version developed especially for ipsative data.

It used nonmetric multidimensional scaling procedures, which tried to reconstruct relative rank

order of inter-variate similarities. The result was, again, low correspondence between the two

test versions. Last they used a categorical correspondence/dual-scaling analysis procedure using

a contingency table, similar to Cornwell and Dunlap (1991).

Recent researchers are more and more liberal in the use of statistical methods. In a study

conducted by Geldhof et al. (2014) they used polyserial correlations (tetrachoric correlations)

and robust weighted least square estimations. However, in this study the inter-correlations

between scales were not considered.

1.6.2. Untestable reliability of ipsative measures

Reliability is defined as a consistency of a test internally from one use to the next, expressed by

freedom from measurement random error (Vogt, 2011). On the other hand, some researchers

address reliability as the reproducibility of measurements or in other words, “the degree to

which a measure produces the same values when applied repeatedly to a person or process that

has not changed” (Shrout, 2012).

To estimate the reliability of a test there are 4 methods that are most frequently used. They are

the test-retest method, parallel tests method, the split half-test method and internal consistency.

The first three rely on comparing the results from two measurements under the same conditions

and the fourth on analysing the relations of items. Having said that, all methods use correlations,

especially Pearson’s R. In the test-retest method the subjects complete a measurement and after

a certain time the measurement is repeated, therefore test-retest is also referred to as an estimate

of reliability in time. In the parallel forms method the subjects are tested for the same trait by

two various tests, which are equivalent. As for the split-half method, subjects are tested with

one test divided into two equivalent halves (Urbánek, Denglerová, & Širuček, 2011).

As for internal consistency, it is the degree to which items bond together. Especially in multi-

scale measures it is expected that items determining one trait will be correlated with other items

determining this trait and they will not correlate with items determining other traits. The idea

22

behind this is that items in a scale are replicate measures of the same construct. In order to

estimate the relations between items Pearson’s R is often used. Then to estimate the total

internal consistency, Cronbach’s alpha is the most used tool (Shrout, 2012). Since Cronbach’s

alpha is a number between 0-1, it is generally agreed that values above 0.7 are evidence of a

reliable test. There are several guidelines, from which for example Kline (2000) states that

reliabilities above 0.7 are accepted as reliable, above 0.8 are highly reliable and above 0.9 are

perfectly reliable. Scores lower than 0.7 are considered not to be sufficiently reliable.

1.6.2.1. Problems with comparing of measures in order to estimate reliability

Since reliability is a concept of CTT, and ipsative measures do not fulfil the basic assumptions

of CTT (see Chapter 1.2), it is difficult or even impossible to estimate it. For the first three

methods of estimating reliability the main argument is that the concept of error mean in ipsative

scales is uninterpretable (Hicks, 1970) and therefore measures such as means of scales, variance

of means and comparison between groups by t-tests are meaningless (Johnson, Wood, &

Blinkhorn, 1988).

To explain, CTT supposes that there is a degree of random error in all test scores. The purpose

of estimating the reliability of a test is to quantify this random error. Ipsative tests by definition

and by their construction do not have any random error as such. In addition, if there are k-scales,

the score of any scale can be calculated from the scores of other k – 1 scales. Johnson et al. adds

that all estimators of reliability share a common theoretical justification and this justification

does not apply to ipsative tests, therefore the term reliability cannot be used in the sense it is

used in normative tests (and it cannot be estimated by methods used in classical tests).

Moreover, Mead (2004) concluded, that scores observed by ipsative measures contain true and

error scores of all other traits measured in the same set of items. This claim is a more radical

expression of the fact that the true score values and error scores in one item are highly dependent

on the true scores and errors in the other items from the same item set. However, the

mathematical relationships of items were not so far described.

Most authors claim that reliability cannot be measured in ipsative measures, because of the

interdependencies of the scales. The reason for this is that reliability can be mathematically

described as “freedom from random error and is operationalised as the amount of shared

variance between two parallel measures” (Allen & Yen, 1979) as cited in (Cornwell & Dunlap,

23

1991). The problems with ipsative data is that they include certain error but the nature of this

error is unknown, because the inter-dependency of scales causes the random errors of all items

to be mixed up. Furthermore, since items within an item set are interdependent, their

correlations with other items are distorted. Therefore the scale means and the scale correlations

are also interdependent.

In addition, most researchers stated that for the above mentioned reasons not even test-retest

reliability can be measured in ipsative measures (Cornwell & Dunlap, 1991) (Johnson, Wood,

& Blinkhorn, 1988) (Hammond & Barrett, 1996). However, if we consider reliability as a

reproducibility of measurement none of the problematic properties of ipsative data can

influence the results. This type of analysis consists of correlating the results of one item with

itself from different measurements or the total result of a scale in different measurements. From

the practical point of view, if the retest results will be similar to results in the first test it should

be sufficient evidence for reliability. If respondents rank the items in the same way (or very

similarly) in each item set, then the test is reliable. Moreover, the closeness of relations of the

test and retest results can be shown without calculation, using scatter plots for example. This

way, there are no statistical methods that could be distorted.

1.6.2.2. Problems with estimating internal consistency

Reliability estimated through internal consistency is more complicated with ipsative measures.

It must first be noted that some studies claim that ipsative measures by nature yield higher

reliabilities than normative (Cornwell & Dunlap, 1991), while other studies report lower

reliabilities in ipsative measures (Baron, 1996). This inconsistency is probably caused by

different methods for estimating reliability. The key concept to consider is that the internal

consistencies of scales in all ipsative measures are necessarily interdependent. As explained by

Tenopyr (1988) by assigning high rank to one item, the respondent imminently deprive the

other items of a high rank. Since items are grouped in sets of items, this means that items in one

set must be negatively correlated among each other. Johnson et al. sum it up that: „Any

consistency within one scale automatically creates consistency in some or all other scales”.

This must result in elevated reliability coefficients, especially in case of internal consistency

(1988). Also, since items for each scale correlate negatively with items from other scales there

is higher probability that they will correlate positively only with items from the same scale.

Therefore artificial reliability within scales is created.

24

Since the scales are interdependent, reliability estimated in one scale must necessarily influence

the reliabilities of other scales. This was most conveniently demonstrated by Tenopyr (1988).

He created several tests in which he introduced one scale that was perfectly reliable (r=1). The

findings were, that the more items were in a scale, the higher were the reliabilities observed in

other scales. Even though this study was conducted using dyads of forced-choice scale, the

results apply generally. To explain, if a scale has very high or very low internal consistency (1

or 0) the other scales will be influenced more by this and will also obtain higher (or lower)

reliabilities than they really have. In addition Bartram showed (1996) that compared to

normative data, ipsative scale reliabilities decrease with the decreasing number of scales and

also with the increasing correlation between normative scales.

Secondly the inherent negative correlations between the scales can be estimated using the

formula

−1 /(𝑚 − 1) , where m represents the number of scales (Clemans, 1966).This formula sums

up that if there are 4 scales, the inter-correlations between them will converge to -1/3. 8 It applies

to measures, in which entire set of rank orders are assigned and in which only the largest and

smallest are assigned (Hicks, 1970).9 It is clear from this formula that the more scales there will

be, the inter-correlations will tend to 0. This can create the illusion of independence of the

scales. On the other hand, according to the formula there will never occur positive correlations

among the scales.

Because the above mentioned problems are evident, researchers proposed several methods to

be able to conduct reliability analysis in ipsative measures. For example Clemans (1966) and

Johnson et al. (1988) advised that upon deleting one or more scales, the data should become

less interdependent and it would be possible to conduct analysis using methods based on CTT.

However, Johnson et al. warned that the other scales would still be at least partially

interdependent.

Others, such as Baron (1996) propose that ipsative data should be used with a large number of

scales (more than 30 scales) in order to achieve low inter-correlations between scales. Under

these conditions reliability can be analysed and will give satisfying results. However, a test

8 The mathematical estimate was empirically confirmed by Hicks (Hicks, 1970), who compared the obtained average inter-correlations of 4 ipsative measures with the expected inter-correlations. 9 For the purpose of my study a different setting was prepared and thus this formula should not apply

25

constructed of 30 scales would not be practical. Either it would have to be extremely long, or

the scales would consist only of a few items, which could compromise the validity of results.

Having said that, Baron was an advocate of ipsative scales. Therefore it is not surprising that

her studies found that ipsative measures have only by little lower reliability than normative

scales (when there is a large amount of scales). Also she is very optimistic about reliability of

ipsative scores. Particularly, she points out that a number of studies showed high correlations

with an external criterion ( (Borkowski, 1989) (Gibbons, 1995) (Gordon, 1976). This statement

is rather surprising since it is in conflict with the above described nature of ipsative data,

especially that correlations cannot be meaningfully interpreted.

To sum up, it is not possible to estimate reliability of ipsative measures using conventional

statistical methods. The reliabilities will always be influenced by the artificial correlations of

scales (even though negative) and the results will not be interpretable.

1.6.3. Factor analysis

Factor analysis is a statistical method used to reduce the number of variables by arranging them

in factors based on their inter-correlations. This method is often used in psychology, especially

in test creating. The main idea of factor analysis is to search for joint variations in response to

unobserved latent variables. The observed variables are modelled as linear combinations of the

potential factors, plus “error” terms. Factors are created based on information about the inter-

dependencies between observed variables. Thus one of the basic assumptions of factor analysis

is that “error” terms are independently distributed.

There are several types of factor analysis. They can be generally divided into exploratory factor

analysis and confirmatory factor analysis. As the title indicates, the first is used to identify the

relationships among items and group items that measure the same concept under one factor.

The former type is used to test the researchers’ hypothesis that some items are associated with

specific factors.

Another typology is based on the variable that is being reduced. In this sense R-factor analysis

is most often used. The R-factor analysis is an attempt to explain the whole by reducing it to

components. This type also assumes that the whole is the equal of the sum of components plus

error. It uses the Principal component analysis (PCA), which is a type of factor extraction. In

26

this, factor weights are computed in order to extract the maximum possible variance (Gabor,

2013).

The opposite, Q-factor analysis is a method to determine dimensions or patterns that exist

within responses and other data from the respondents. In other words, it is the analysis of profile

types, which identify groups of people using by-person factor analysis (Ramlo & Newman,

2010). As compared to R-factor analysis, Q-factor analysis works not with a representative

population sample but with a representative sample of opinions. According to Gabor (2013)

this type of factor analysis is both inter- and intra- personal. Furthermore, Q-factor analysis uses

the centroid analysis method of Thurstone.

The question that emerges in connection with factor analysis is firstly, whether there is any

valid way to apply it to ipsative measures, and secondly, whether there is any purpose to it. To

explain the second question, ipsative measures are built on ranking items in item sets and thus

ordering them. Their construction requests to group items from various scales in sets of items.

Therefore each item must represent a factor. For this reason factors must be known before

creating a test, in order to group them accordingly. Then the only purpose factor analysis can

serve is to confirm the factor structure, which is already known (or at least presumed).

The application of Factor analysis on ipsative measures is not a new topic. Ever since ipsative

measures were used, there was a blazing discussion about whether ipsative data can be factor

analysed and produce valid results. In his book Guilford noted that “R technique factor analysis

calls for normative data” (1954) as cited in (Johnson, Wood, & Blinkhorn, 1988). The reason

for this was the above described relative nature of ipsative data. Johnson et al. support their

statement by the following argumentation.

In factor analysis, the only relationships between scales should be those showing the existence

of common factors. However, ipsative scales are not independent by their nature, due to their

feature that scores from scales add up to the same total every time. The spurious correlations

existing between scales of ipsative measurements break down the factor analysis thanks to these

built in dependencies and the results are “degenerate and illegal” (Johnson, Wood, &

Blinkhorn, 1988).

As they demonstrate, the basic R factor analysis model can be written as

𝑋𝑖𝑗 = ∑ 𝛾𝑖𝑘𝑓𝑘 + 𝜀𝑖𝑗𝑚𝑘=1

27

, where the 𝛾𝑖𝑘 are the factor loadings and the 𝜀𝑖𝑗 are the specific factors or residuals. The

𝜀𝑖𝑗 are assumed to be independent of all other 𝜀 and of 𝑓𝑘. Because of ipsative data always add

up, it means that 𝑋𝑖𝑗 will have the same value for all respondents and therefore whatever errors

are present they must be correlated (Johnson, Wood, & Blinkhorn, 1988).

On the other hand, in Guilford’s opinion Q technique factor analysis can be used for inter-

correlations of ipsative data. Johnson et al. disagreed by stating that Q factor analysis could be

an option but only with very weakly ipsative data, and even then it is not certain if the results

would be reliable.

Another suggestion that should enable factor analysis of ipsative measures is increasing the

number of scales. By doing so the interdependencies of scales would decrease (Loo, 1999). In

1991 Saville (Saville & Willson, 1991) tried to show that an ipsative measure with more than

30 scales had inter-correlations close to 0 and conducted factor analysis on it. Even though, the

results seemed promising, this study was severely criticised for methodological misconducts.

His results were not replicated. As a direct reaction to this paper Cornwell and Dunlop (1991)

published a study where they refuted and empirically disproved all the statements of Saville.

They showed that factor analysis of ipsative data suffers from imposed multicollinearity.

It is noteworthy that certain test publishers would not give up on their ipsative measures and try

to determine ways in which to decrease their problematic properties arising from ipsativity. As

an example of this10, in 2013 the PH.D. Thesis of Anna Brown (2010) was published in which

she stated that the problems of ipsative data can be overcome using a newer approach – item

response theory (“IRT”). In her thesis she empirically confirmed that is possible to

meaningfully estimate reliability of ipsative data in IRT. Furthermore, she attempted to conduct

factor analysis on ipsative measures and her attempt seems to have succeeded. Therefore, it is

possible to suggest that ipsative measures should be analysed using the IRT approach.

10 Incidentally, Anna Brown worked at SHL Group, which developed the ipsative vocational inventory OPQ32 and OPQ32i.

28

1.6.4. Cluster analysis

Cluster analysis is an exploratory data analysis tool designed to group individuals similar to one

another into clusters. Similarly like factor analysis it examines the full complement of inter-

relationships between variables, to maximise the dissimilarity between clusters. The clusters

are defined through analysis of data, mainly multi-variate analyses. Cluster analysis does not

serve for interpreting the groups, nor for estimating the underlying common trait. It only creates

groups of individuals similar to each other, but dissimilar of individuals in other groups.

It differs from factor analysis in more points. Firstly, while factor analysis reduces variables by

grouping them into a smaller number of factors, cluster analysis actually reduces the number of

cases by grouping them into less clusters. Therefore it is said to be the obverse of factor analysis

(Burns & Burns, 2009). Usually, cluster analysis is conducted in two steps. First the clusters

are identified using one of the numerous methods and in the next step the cases are allocated to

a particular cluster (Romesburg, 2004). However, it is also possible to conduct cluster analysis

on the variables, not respondents. This way, clusters of variables are created.

The first step of cluster analysis is conducted usually through hierarchical cluster analysis which

estimates the clusters using distances between data points. The distance can be measured in a

number of ways from which the Squared Euclidean distance (estimating distances in

multidimensional space), Wards method (estimating standard deviations from the mean) and k-

means clustering are the most often used (Romesburg, 2004).

The nature of cluster analysis allows for estimating higher order groups without using

complicated statistical methods like in factor analysis. Therefore its applicability for ipsative

measures will be empirically tested in this study.

29

1.7. Advantages of ipsative measures

1.7.1. Reduction of Response Bias

One of the main advantages of ipsative measures is that they prevent respondents from faking,

thus decreasing the distortion of results caused by social desirability11. As Mccloy et al. (2005)

stated there are two characteristics that make ipsative measure resistant to systematic faking.

Firstly, it is their format, which prevents respondents from providing high ratings on all

constructs. Secondly, it is possible to group the items in such a way that all items will be equally

desirable. Also, the ranking of items prevents respondents from obtaining socially desirable

results.

This effect of ipsative measures was corroborated by more researchers, such as Jackson,

Woebelski & Ashton (2000), White & Yong (1986), Wright & Miederhoff (1999), Chen et al.

(2008) and others. It must be noted, that forced-choice format does not eliminate social

desirability, but decrease its effect considerably (Jackson, Worbelski, & Ashton, 2000). Mccloy

et al. (2005) is well aware of the limitations of reducing social desirability in ipsative data. He

discusses that even though it is highly unlikely for respondents to fake within sets of items, they

can fake the results in total scores. To explain, if someone wants to achieve high score in one

desired scale, he will rank high items connected to this scale in each set of items. On the other

hand, in order to do so he would need to correctly identify which items belong to which scale.

As always there is another side of the coin. In this case there is a number of researches that did

not find evidence for reducing social desirability in ipsative measures. As Furnham et al. cited

Anastasi in (Furnham, Steele, & Pendleton, 1993) “it appears that the forced-choice technique

has not proved as effective as had been anticipated in controlling faking or social desirability”.

Also, Hammond & Barrett (1996) point out that ipsative measures could reduce response bias

only if all items in a set of items had the same amount of average affectivities. It this is not true

and some items are more desirable than others, the test will produce even worse artefactual

distortions than normative tests by building response bias into itself.

11 It must be noted that Mccloy connected reduction of social desirability to forced-choice format. However, this format is closely related with in ipsative data.

30

1.7.2. Moderate responding

Another positive property of ipsative measures is that they partially eliminate the moderate

responding problem. According to Tamir & Lunetta (1977) Ipsative measures have higher

discriminability value than normative tests, which means that they emphasise true scores of

measured traits of individuals. This notion is supported also by Baron (1996) who states that

forced-choice format generates higher differentiation, because people are forced to choose

between items, and cannot chose the same two items. The argument behind the higher

differentiation is that in ipsative measures people cannot avoid extreme values, since in each

item set they must assign the highest and the lowest rank to some items. What is more, the

construction of ranking itself prevents respondents from assigning only extreme values.

Therefore the resulting points will be evenly distributed using the whole range available (in an

item set).

1.7.3. Decision making in responding to normative vs. ipsative

questionnaires

The decision processes in ipsative data are somewhat different from those in normative data.

The greatest difference is that in normative questionnaires there are no reference points,

whereas in ipsative measures the other items serve as this purpose. According to Kahneman,

reference points are extremely important in making decisions (2012). In his book “Thinking

Fast and Slow” he stated that human decisions are highly dependent on references. When

making decisions people need some virtual point, which we can use for comparison (Kahneman

refers to it as a reference point). Without reference points the world would be confusing and

chaotic.

Another issue are the cognitive procedures ongoing when filling a questionnaire. Mead is rather

sceptical when he states that the decision procedures in ipsative measures are not fully

understood (2004). We only know that ipsative measures include more cognitive complexity

on the part of the respondent. As a result, it is harder to fake results, but also it may be harder

for respondents to correctly decipher the meanings of the statements and rank them according

31

the actual levels of latent traits. Also, the increased mental strain can demotivate the respondents

in longer and more complicated questionnaires.

Undoubtedly, the decision procedure in ipsative tests are more difficult than in normative tests.

This can, however, be considered an advantage, since respondents must think about their

answers. Furthermore the items are not considered individually, but together with other items

from an item set. Thus the filling of an ipsative measure demands more motivation from the

respondent, but on the other hand the results should better reflect the true score in terms of

ranks.

1.7.4. Summary of advantages and disadvantages of ipsative measures

The most frequently mentioned advantages of ipsative measures include the following:

1. Ipsative measures have higher discriminability value than normative tests;

2. They are said to be resistant to social desirability and respondents can alter their results

less than in normative tests;

3. Ipsative measures seems to be resistant to “moderate responding”;

4. Ipsative measures might better reflect choices people make in real life, since they cannot

choose all the possibilities, but are forced to choose only some;

5. Hicks (1970) suggested that in some circumstances ipsativity may increase validity – if

it reduces response bias.

On the other hand, the greatest disadvantages of ipsative measures are that:

1. the results are only relative values, which means that in general they cannot be compared

with results from other individuals;

2. generally it is not advisable to calculate means from ipsative scores, since they are

ordinal data; which are furthermore inter-dependent and therefore it is not clear what

would the means show;

3. because of the inter-dependency neither variance can be estimated, nor standard

deviation;

4. correlations and correlation based analysis cannot be used with ipsative data – thus

psychometric tools for assessment such as reliability estimates, factor analysis, t-tests

etc. are unusable;

32

5. ipsative measures will not allow for respondents to reach high level in more traits,

therefore respondents with high true scores on all scales will obtain distorted results;

6. ipsative measures can be cognitively challenging for respondents and therefore more

motivation is needed in order to finish them adequately.

1.7.5. Applicability and use of ipsative measures

From the beginning of their use until now, ipsative measures are being mostly used in

counselling psychology as tools to determine vocational preferences. They help psychologists

to determine, which career field would be appropriate for the respondent.

Obviously, the question is, whether ipsative measures provide more valid estimates. There were

several studies comparing the usability of ipsative and normative measures in personnel

selection. As Meade (2004) showed, the selection of test form can highly influence the results.

He summarised that ipsative measures could be useful in personnel selection, especially for

creating desired personality profiles for certain positions. Then these profiles could be used for

comparison with the profiles of applicants. However, so far there are no conclusive results on

which type of measures are more valid.

Furthermore, ipsative measures might be used in other fields as well. According to Tamir &

Lunetta (1977) ipsative measures produced more valid results than normative ones when they

researched cognitive preferences of people. Similarly Fredrick and Hilliard (Frederick & Foster,

1991) proposed and empirically confirmed that ipsative measures could be included into

cognitive tests in order to detect malingering and non-compliance.

As for personality traits, ipsative measures should be used only to create personality profiles.

According to Johnson et al. (1988) these profiles can be compared with profiles of other people

(even if a common metric is not present). Even though there is number of ipsative personality

tests, it is difficult to estimate whether they measure what they claim to measure.

33

1.8. Summary of ipsative measurements

After analysing the most important studies and scientific publications concerning ipsative data

it occurs that more scientists are against the use of ipsative measures than in favour of them. It

is a fact that some studies such as Cornwell & Dunlap (1991), Johnson et al. (1988) and Meade

(2004) have provided very strong arguments against using any type of CTT analysis on data

from ipsative measures. Therefore, advocates of ipsative data were not left too much to defend.

Currently, it is mainly test producers creating ipsative test, who try to fix the reputation of

ipsative measures. In fact, most psychometricians would not recommend the use of ipsative

measures and some would eliminate them entirely - in the study conducted by Hammond &

Barrett (1996), which was published under the British Psychological Society, they concluded

that “there seems little point in continuing with the development or use of ipsative tests. At best

they approximate normative data, at worst, they distort and change completely the

psychological import of trait scores and their interpretation”.

Having said that, the reason this study was conducted, is to confirm whether the properties

described by researchers previously do also apply to a hybrid semi-ipsative measure, which was

designed in a way that it should possess the positive attributes of both normative and ipsative

measures. Since for creating this hybrid test the well-known NEO FFI inventory was used, the

original version will be briefly described in the following chapter.

1.9. NEO personality inventory

1.9.1. History of Big Five Model and NEO inventory

The NEO Personal inventory is one of the most used and most researched personality

inventories in psychology. It is based on the Big Five Model that became one of the most

influential models in personality psychology in the past century. It was first advanced in 1961

by Ernest Tupes and Raymond Christal, but was recognised by academic audience only in the

late 80ties.

The Big Five personality traits are five dimensions of personality that are used to describe

human personality. The theory based on the Big Five factors is called the Five factor model

(FFM). These five factors are Openness, Conscientiousness, Extraversion, Agreeableness and

Neuroticism also referred to as OCEAN. It is assumed that these five domains subsume most

know personality traits (Hřebíčková, 2011).

34

The greatness of the FFM is given by the fact, that it was defined by several independent

researchers consequently (Digman, 1990). The origins of the Five Factor model date back to

the work of Galton, who hypothesised that by sampling language it is possible to derive a

comprehensive taxonomy of personality traits. Later Gordon Allport followed up on Galton’s

theory and extracted 4.504 adjectives describing personality from dictionaries. In 1940,

Raymond Cattell retained the adjectives and eliminated synonyms by which he reduced to the

total amount to 171. Based on this he constructed a personality test called 16 PF. Then, the

adjectives were factor analysed and researchers who followed up arrived at 5 factors. For these,

the best descriptions were OCEAN as stated above. Thanks to the empirical support, the FFM

traits contain most known personality traits and are assumed to represent basic structure behind

all personality traits (Hřebíčková, 2011). Furthermore, the validity of the theory is empirically

corroborated, since all research teams starting from the earlier Ernest Tupes and Raymond

Christal, through Lewis Goldberg, Cattel until the latest work of Costa and McCray used

different methods, but arrived at results, that were highly inter-correlated.

Costa and McRae conducted an analysis of items from Cattell’s 16 PF personality inventory,

where they identified three traits, namely Neuroticism, Extraversion and Openness. These items

became the origin for construction of the NEO inventories, hence the name NEO. However,

later analysis and further research suggested that there are in fact five basal personality traits,

therefore they added two scales. In addition, the lexical analysis in other languages (especially

Czech language) also resulted in the same 5 factors (Hřebíčková, 1997).

The first version of the NEO personality inventory was published in 1978 as the Neuroticism-

Extraversion-Openness Inventory. Since Costa and McCrae added two more factors the test

was modified. The revised version was named NEO Personality Inventory. It consists of 180

questions intended to measure Big Five personality traits. It was developed for use with adults

without overt psychopathology. The first 3 scales, namely NEO consist of 48 items divided into

6 subscales. The two other traits, Openness and Conscientiousness contain 18 items each and

have no subscales. The currently most used version is the revised version of NEO PI, which

contain 240 items, 48 items for each scale and every scale is divided into 6 subscales. Because

of its length it is used when an in-depth analysis of the personality profile is necessary

(Hřebíčková, 2011).

Later also a shorter version of the NEO PI-R was created, called NEO Five-Factor Inventory

(NEO FFI). It uses 60 items, which means 12 items per domain. Because of the relatively fast

administration (10-15 minutes) it is gaining on popularity in the past years. It also provides

35

information about the five personality traits and it can be used in research, in clinics, in

counselling and in occupational psychology.

1.9.2. Description of the Big Five personality traits

Neuroticism is a measure that differentiates between emotionally stable or instable individuals.

Generally individuals with high amount of this trait tend to experience negative emotions such

as anger, anxiety or depression. They also show lower tolerance for stress and averse stimuli

and tent to be emotionally reactive. As for individuals with low Neuroticism scores, they can

be described as being calm, emotionally stable and they also experience less negative feelings.

Extraversion measures the extent to which an individual is sociable, self-confident, active,

energetic, happy and optimistic. Individuals with high Extraversion score enjoy interacting with

people and are often full of energy. Individuals with low score are closed, reserved, independent

and self-reliable. These individuals are referred to as introverts.

Openness to experience is a dimension connected to rigid or original, non-conventional ways

of behaviour or thinking. Individuals with high value of this trait are endowed with lively

imagination and are generally more sensitive to aesthetic stimuli. On the other hand, individuals

with low Openness tend to behave in conventional ways and are also rather rigid in their beliefs.

They prefer simple and plain over complex and ambiguous. Hřebíčková (2011) adds that they

can be inhibited in their emotional reactions.

Agreeableness is the dimension best characterizing interpersonal behaviour. People with high

scores on this scale are more altruistic, they have better abilities to understand other people and

to trust other people. They are generally described as kind, trusting and trustworthy and helpful

and possess an optimistic view of human nature. Individuals with low scores are generally

unfriendly and egocentric, meaning that the well-being of others is of little interest to them.

Conscientiousness is the trait connected to ones approach to organization and work. High scores

are connected to determination, assertiveness but mostly hard work and systematic work.

Furthermore characteristics as strong will, discipline, reliability, exactness and precision are

well describing conscientious individuals. The other pole of the scale is described by indifferent

attitude, little interest, chaotic organisation and low will power. Low scores on

Conscientiousness can also account for more spontaneous behaviour (Hřebíčková, 2011).

36

1.9.3. Psychometric properties of NEO-FFI

The NEO FFI inventory consists of 60 items, in which the respondent indicates how well a

given statement describes him. The test is thus constructed using a 5 point Likert scale ranging

from 0-4.

NEO PI-R as well as NEO FFI are reported to have very high reliability. According to

Hřebíčková (2011) the NEO PI-R was processed on a sample of 2296 respondents. The internal

consistency of the NEO-PI-R was high on every scale, for N = 0.92; E = 0.89; O = 0.87; A =

0.86; C = 0.90. As for the NEO FFI, the internal consistencies reported were also relatively

high, but lower than in NEO-PI-R. The respective values were N = 0.86; E = 0.77; O = 0.73; A

= 0.68; C = 0.81. Even though the literature in most cases supports the internal consistencies

stated in the manual, different studies resulted in slightly different reliabilities (the reliabilities

for the tests published in the Czech Republic were almost the same as mentioned above).

The retest reliability of NEO PI-R is also satisfying. A group of respondents was retested after

6 years using NEO PI-R and the following reliabilities were estimated: N = 0.83; E = 0.82; O

= 0.63; C = 0.79. This result shows not only that the test is reliable and precise for measurement,

it also shows that the traits measured are stable in time and change little after a certain age.

Kurtz and Parrish (Kurtz & Parrish, 2010) measured the short term test-retest reliability, and

the results were satisfying, ranging from 0.91 – 0.93 within a one week interval between tests.

Terrracciano et al. (2006) showed that the test also has high long term reliabilities. He measured

reliability in the interval of 10 years and the results ranged from 0.78 – 0.85.

In the manual to the NEO FFI, the validity of NEO PI-R is also described, with a focus on the

convergent, discriminant validity and criterion validity. For comparisons other widely used

personality tests were used such as Myers-Briggs Type indicator and Self-Directed Search by

Holland. According to Conrad (2006) Conscientiousness predicted the GPA of college students

over using SAT scores alone. Neuroticism was correlated to emotional exhaustion in a Spanish

study of teachers conducted by Cano-Garcia et al. (2005).

There are many other studies that confirm high reliability and validity of the NEO personal

inventory.

37

2. Hypothesis

The aim of this study is to find out whether the form of a personality test influences the results

of the respondents. In order to answer this question, I have prepared two versions of the same

test (NEO-FFI), which consist of the same questions but are different in the way they are filled

out. Namely, the first version of the test is normative (hereinafter referred to as “Form A”) and

the second version of the test is ipsative (hereinafter referred to as “Form B”).

Research Question: Does the form of a personality inventory influence the results of

respondents? And if so which form offers more advantages?

1) The basic hypothesis was that the results acquired by Form A will not differ significantly

from those obtained from Form B.

This includes the hypotheses that:

a) The results from Form A and Form B will show a linear relation;

b) The results from Form A and Form B will be highly correlated;

c) Form A and Form B will detect the same level of measured trait in individuals;

d) The items used in Form A and Form B will give similar results (highly correlated);

e) The order (rank) of the big five traits in each individual will be the same in Form A than

in Form B.

2) The second hypothesis was that Form B will eliminate some of the disadvantages connected

to Form A, namely:

a) Form B will show distinctively less response bias especially moderate responding;

b) Form B will eliminate the influence of Social Desirability.

The goal of this study was also to compare the psychometric properties of Form A and Form

B, especially in respect to reliability and validity of these measures.

38

3. Method

3.1. Administration

As for the administration method the electronic form was chosen and the tests were

administered through the internet.

Both versions of the questionnaire were completely anonymous. The respondents were

requested to fill out only basic information such as email, gender, age and occupation (the last

field was optional). The email address was necessary because of the re-test. Since no other

personal information was requested, the respondents could have been contacted in order to fill

out the questionnaire for the second time only through the provided email.

The normative version of the NEO-FFI (referred to as “form A”) was created using google

documents in the form of an online questionnaire. This form contains a short introduction with

basic instructions on how to fill it out. The questionnaire is self-explanatory and can be

administered without any further explanation. Therefore no errors in the process of filling out

were expected.

As for the ipsative version, at first an interactive worksheet in Microsoft Excel was created,

since no online form would work with ipsative measurements. The worksheet created consisted

of an introduction page, where the respondents were debriefed and informed that a re-test will

take place within 3 months. They were also requested to fill out basic information mentioned

above. After filing out the information they could click at a button “Spustiť test” (“start the

test”) and they were automatically redirected to the second sheet.

The second sheet contained the test itself – 60 statements (or items) arranged into 12 groups

each consisting of 5 items. This sheet was interactive and included control scripts that would

not allow the respondents to fill in the questionnaire in a wrong way. As it will be described

below, respondents could not assign more than 5 points to a statement and the total points

assigned in one group could not be more than 10. The control worked in the fashion that at first

only one group of items was visible – the others were hidden (conditional formatting was used

and set the background colour and the text colour to yellow). Only after they assigned points

correctly to the first group of items, the second group of items appeared (thanks to conditional

formatting, once the sum in the first group was 10 the background changed from yellow to

39

white and the text changed from yellow to black). This mechanism worked through the whole

sheet, which secured that the respondents filled it out correctly.

This control mechanism was needed, since the method of filling out the ipsative questionnaire

was not so simple and comprehensible. All instructions were provided in a textbox beneath the

test, however, it was expected that most respondents will not take their time to read it properly

(if they read it at all).

After filling out the test, the respondents could click on the “Vyhodnoť test” (“evaluate test”)

button, which would take them to the third sheet. On the third sheet there was a graph (bar

chart) displaying their results (instantly calculated). On the right hand side there was a text box

containing a basic description of each of the five traits measured by NEO-FFI (and presumably

my test).

As for Form B it was not so practical, since in order to be filled out the respondent must have

first downloaded it, after filling out save it and then send it back to the email address provided

in instructions. All the instructions were stated on more places within the worksheet as well as

in the e-mail or post by which respondents were approached. However, the problem was that

people did not want to download the worksheet, mainly because of security reasons.

Therefore, a website was programmed, which consisted of similar elements as the excel

worksheet, especially four sheets from excel were rewritten into web pages. It also included the

same control mechanism in order to secure that the questionnaire was filled out correctly. It is

very user friendly, and similarly anonymous like the excel sheet. Unfortunately, it was not able

to set up a functioning interactive graph that would display the results, therefore the results were

sent individually to each respondent after filling out the form.

3.2. Creating an ipsative version of normative NEO-FFI

For the experiment two versions of the NEO-FFI test were prepared. Form A was the same as

NEO-FFI, except that all the negative questions were reformulated to positive ones. To prepare

an ipsative form of NEO-FFI it was necessary to create groups of items (taken from NEO-FFI).

The result was that in Form A people could assign points to every question on a 5 point Likert

scale, whereas in the ipsative version of the test (form B) people could distribute 10 points

within 5 items.

40

3.2.1. Grouping the items

Ipsative tests consist of statements (or items) organised into questions (group of items)12. In

each question a respondent can either order the items based on which describes him the best, or

he can assign points to the items (as described in introduction). One of the criteria of grouping

is that each item set should contain items from all traits being measured (in this case

Neuroticism, Extraversion, Open-mindedness (or Openness), Agreeableness,

Conscientiousness).

In most ipsative tests, items are organized in item sets based on some unifying variable. E.g. in

Belbin’s Team Roles Self-Perception Inventory the item sets contain an introduction sentence,

which describes a certain situation and 10 items are presented (each representing one trait, or

more precisely team role). This introductory question provides a logical background for the

respondent who can better conceptualize the meaning of the items, imagine himself in the

concrete position and easier assign a value that is representing his subjective opinion based on

a recollection connected to a specific situation presented (Belbin, 2003).

In this study the groups of items were created randomly, based on their order in NEO-FFI.

Unfortunately, it was not possible to introduce the items with an introductory sentence. In order

to do so, 12 groups of 5 items by 1 for each trait would have to be found, so that each items in

one group would have a common factor that would be able to depict by a certain situation. It

would not be possible to find such groups within NEO-FFI questions without major changes.

A very important aspect of grouping of items was to create appropriate sets of items. According

to Meade (2004), it is most desirable for scale items to appear in item sets with other scales an

equal number of times. To explain, if items from one scale would occur more often with items

from some scales than with items from other scales it would lead to higher negative correlations

between the scales with more items occurring together. The scales occurring in fewer set would

have lower correlations with other scales. The easiest way to achieve this is to include items

from each scale to every item set.

In this particular study, the number of maximum items per set was not issued, since there were

only 5 scales. However, in tests with more scales it could be problematic to include items from

all scales in one set of items. With an increasing amount of items the difficulty and cognitive

12 This is because of the character of ipsative tests, which lies in ordering the items.

41

strain posed on respondents would raise proportionally. Also, the possible ways of arranging

the ranking would also be higher, thus increasing the variance. Furthermore, with the increasing

number of scales the interdependency decreases, therefore it the covariance-level of

interdependence would be reduced. On the other hand, the true score would be influenced by

more distorting factors (meaning the true and error scores of other scales) (Meade, 2004).

3.2.2. Determining the maximum points to be distributed in groups in

form B

Since purely ipsative tests can only rank the items, this method creates ordinary data. However,

the aim of this work was to create an ipsative measure with higher level of data (at least

interval). After in-depth analysis of the well-known ipsative Belbin’s test of team roles (Belbin,

2013) a similar model of point distribution was considered for use. However, in Belbin´s test

respondents could distribute only so many points as there were items. Experience from work

with Belbin´s test showed that it is very limiting to distribute only 10 points within 10

statements. Therefore the hypothesis arose, that if people could distribute more points than there

are statements, this will result in higher variability and more exact results.

The drawback of distributing points are, however, that the more points can be distributed the

harder it is to calculate. The criteria on which the choice of the amount of points to be distributed

for each question was decided were the following:

The sum must be divisible by the number of traits (5)

The sum must be a number that is easily calculated (round numbers are generally easier

to add up)

The sum must not be too high (50 is definitely too high)

To briefly explain these criteria, as for the first one, in case that none of the statements describe

the respondent well, he should be able to distribute the points evenly. Even though this does not

happen very often, if the sum of points to distribute was not divisible by 5 the respondent would

be forced to give some answer more points than he wanted to.

The second criterion is connected with the user-friendliness of the test. Very difficult formats

of test often fail to collect relevant data, since people are exhausted by the method itself and

cannot focus on the actual content of the test (Oppenheim, 1996). In this case, it would be

42

unnecessarily difficult for the respondents to calculate sums of 5 numbers to numbers that are

hard to add up (e.g. 17, 19 etc.). Generally, the easiest numbers to add up are round numbers

like 10, 20, 30 etc.

Finally, the easier it is to assign the points, the best. In other words, the less demanding the

calculation will be, the more the respondents can think about the content of their answer.

Therefore the sum of 10 points to be distributed among the 5 items was chosen.

3.2.3. Transformation of negative questions

Since the ipsative version of the test required a distribution of certain points among different

items, this form could not contain negative questions in group with positive ones. The reason

for this is that when there is a negative question in NEO-FFI, it is evaluated in a reversed way

compared to positive statements (in order to detect the true score of certain trait, negative

questions must be transformed as if they were positive questions. E.g. if someone choose 4

points on a 5 points Likert scale in a negative question this would be transformed to 2 points in

the evaluation). However, in ipsative measures it is not possible to transform points. The nature

of ipsative scale is such that there is no certain range of how much points can be assigned to

one statement and how much to another. This applies also if there is a limitation on the

maximum of points that can be assigned to one statement (in this study, the maximum was 5

point out of the total 10 to be distributed).

To explain, even though one can assign at most 5 points to one statement, there is a secondary

limitation in the form of other questions. Because, this maximum cap can be changed by

assigning values to other questions. Since he can distribute only 10 points, assigning 6 points

to the other 4 statements would decrease the maximum he can assign to the negative question

to 4. Because of this nature of ipsative tests, it is not possible to evaluate negative questions by

assigning them reversed values (we do not know what the scale on which we would calculate

the reversed value is).

Since the ipsative form cannot bear with both negative and positive questions at the same time,

there were a few options on how to eliminate this methodological problem. First, there was the

possibility to transform all questions to either positive or negative. The second option was to

group questions so that in one group there would be only positive or only negative questions.

43

This option seemed more complicated and methodologically problematic. First of all it was not

even possible, since the amount of negatively formulated questions was not equal within the

five traits (in the ipsative form of the test there must be one statement for each trait in a group).

However, even if it was possible, it would be difficult for the respondents to decode the

negatively formulated questions and assigned them the correct values (not only would they have

to create the positive formulation of each question and keep it in their working memory, they

would also have to order these questions according to the relevance to them).

The option chosen was to transform all negative questions to positive and thus the whole test

contained of positively formulated questions. This option allowed to create groups of five items

with items from each trait and evaluate the responses without any problems.

3.3. Experimental design

The transformation of questions mentioned above posed a change of the NEO-FFI test. The

psychometrical values such as reliability and validity of this “modified” NEO-FFI were

unknown. Therefore the reliability of this test had to be estimated first (the question of validity

was of no importance for the purpose of this study).

In order to test the reliability both forms of the test were re-tested. Apart from this Form A was

to be compared to Form B. It logically follows that the experimental setting used was within

subject design. This design needed respondents to be divided into 4 groups as depicted in Table

1.

Group A Group B Group C Group D

Test Ipsative Normative Ipsative Normative

Re-test Normative Ipsative Ipsative Normative

Table 1 : Within subject design

3.4. Respondents

In this study volunteers were used from available sources (school, friends etc.). Initially it was

planned to gather at least 50 respondents for each group. Since data were collected through

online questionnaires no specific group of the population was targeted. However, it is obvious

that the majority of the respondents were students.

44

The total amount of respondents was 303. From these, only 183 participated in both the test and

re-test measurement. Thus the experimental mortality was 40%.

The data collected from respondents were as anonymous as possible. Only information

regarding gender, age and email address was collected from each respondent. Respondents

could optionally provide their occupation. The email address was necessary for contacting the

respondents after 3 months of the first administration in order to conduct the re-test. If the

respondents did not react within one week either by a reply or by filling out the questionnaire,

they were contacted a second time13. After the second notification, no further emails were sent.

Respondents who participated in both the test and re-test measurement obtained a graph with

results together with a description of scales as promised.

4. Results

The data were analysed in SPSS, Excel and Statistica. If not indicated otherwise SPSS was

used.

4.1. Total data

The total amount of questionnaires collected was slightly below 500, in exact terms 495. From

these 259 was collected for the normative form, 214 was collected for the ipsative form and

around 30 was excluded, because the respondents did not fill them out properly (ipsative form).

After deducting the repeated measurements in groups C and D from this total, the remaining

number of questionnaires is 214 for normative form and 169 for ipsative form.

4.2. Normative data

4.2.1. Description of normative data

From the total amount 214 respondents filled out the normative version of the test (from Groups

A, B and D). From these 30.8 % were males and 69.2 % were females. As for the age, most

respondents were around their twenties. The range from 18 to 24 years constituted 84% of the

respondents. The minimum age was 16 and the maximum age was 65, thus establishing the

13 More than 500 notification e-mails were sent individually to respondents throughout the experiment

45

range of 49 years. The estimation of central tendency indicators showed similar results, the

mean was at 22.7 years, the median at 21 years and the mode was 20 years. Only 2.5 % of

respondents were younger than 18 years and 13.6 % were older than 24.

As for the occupation, 106 respondents (49.5 %) stated that they are students, 75 respondents

(35 %) did not mention their occupation and the rest of 35 respondents stated various

occupancies, which differed for each of these remaining respondents.

4.2.2. NEO-FFI Results

The results of means for respective scales were obtained at following values:

N = 23, E= 30.2, O= 29.1, A=29.3, C= 29.6.

The medians were the same, except without the decimals. The standard deviations were not

high, only for Neuroticism and Conscientiousness. Results in various scales ranged from

minimal values of 2 points up to maximum 48 points (maximum achievable on the scales). The

respective ranges, minimums and maximums are described below in Table 2.

Neuroticis

m

Extraversio

n Openness Agreeableness

Conscientiousnes

s

Mean 23.00 30.18 29.08 29.33 29.62

Median 23.00 31.00 30.00 30.00 30.00

Std. Deviation 9.91 7.97 6.78 6.35 8.66

Variance 98.36 63.65 46.004 40.34 75.12

Range 46.00 38.00 40.00 38.00 41.00

Percentiles 25 15.00 25.00 24.00 26.00 24.00

50 23.00 31.00 30.00 30.00 30.00

75 31.00 36.00 34.00 34.00 36.25

Table 2: Descriptive statistics of NEO-FFI scales for normative sample (N = 214)

Differences between groups of males and females were analysed and only minor discrepancies

were found, without statistically significant differences. In general, female scored by 0.5 -1 raw

points higher on scales than males, except for Neuroticism, where females scored by 3 raw

points more.

46

The effect of age was also considered. Respondents were divided into two groups, where

“young” were defined as 24 years old and younger, and “adults” were defined as 25 years old

and older (these ages were used because there was a significant decrease of frequency of

respondents above 24 years). It is important to notice that the former group consisted only of

23 respondents. The means of scales of this group differed significantly from the “young”

group, especially on the scale of Neuroticism, Agreeableness and Conscientiousness (t-test

analysis showed significant differences with t = 2.72, df = 211, p < 0.01 for Neuroticism; t = -

2.7 df = 211, p < 0.01 for Agreeableness and t = -4.36, df = 211, p < 0.001 for Conscientiousness

scale). As it can be seen in Table 3Table 3 Descriptive statistics comparing "Young" vs. "Adult" respondents

below, the greatest difference appeared in the scale for Conscientiousness, where the mean of

“adults” was by 8 points higher than those of “young” people. Furthermore, Levene’s Test

showed that the sample was not homogenous in Conscientiousness scale.

age_groups N Mean Std. Deviation

Neuroticism Young 191 23.63 9.95

Adults 22 17.63 8.26

Extraversion Young 191 30.17 8.00

Adults 22 30.41 8.10

Openness Young 191 29.28 6.91

Adults 22 27.09 5.41

Agreeability Young 191 28.92 6.43

Adults 22 32.73 4.56

Conscientiousness Young 191 28.76 8.61

Adults 22 36.96 5.30

Table 3 Descriptive statistics comparing "Young" vs. "Adult" respondents

4.2.3. Internal consistency

The internal consistencies for respective scales were estimated using Cronbach’s Alphas. For

all of the scales the Cronbach’s Alpha was very high thus indicating very high internal

consistency. The respective coefficients ranged from 0.7 for the scale of Openness to 0. 86 for

the scale of Neuroticism. The results are summarized in Table 4 below.

47

Scale Cronbach's Alpha N of Items

Neuroticism 0.89 12

Extraversion 0.84 12

Openness 0.70 12

Agreeableness 0.72 12

Consciousness 0.89 12

Table 4 Reliability statistics: Crombach Alphas for Normative form

Upon more detailed analysis of the reliability statistics it can be stated that the items measuring

Neuroticism showed medium inter-item correlations ranging from 0.02 to 0.66 with the average

correlation of 0.4. No extremely high or very low correlations were observed. This pattern was

also confirmed in the indicator of Cronbach’s Alpha if certain items were deleted. No items

would increase the reliability, in fact if an item was deleted it would decrease the reliability of

this scale except for the first one, but the increase would be only slight.

The scale of Extraversion can be described in the same way, except that there was higher

variance in the inter-item correlations (ranging from 0.01 to 0.71). Also, item number 12

appeared to be lowering the overall reliability and upon deleting the reliability would increase

to 0.85.

In the scale of Openness there appeared also some near zero negative correlations. The lowest

correlation was -0.06 and the highest was 0.64 with the mean at 0.16. This scale appeared to be

less consistent since deleting items 1, 4 and 8 would increase the reliability. The first two only

slightly, the last one by 0.02.

The items measuring Agreeableness showed little lower correlations ranging from -0.03 to 0.44

with a mean of 0.19. There were again three items that would increase the reliability slightly if

deleted (item 6, 9 and 12).

Last, items for Conscientiousness showed highest Cronbach’s Alpha at 0.89. The correlation

matrix revealed a general pattern of medium positive inter-correlations ranging from 0.25 to

0.68 with a mean of 0.41. No item would increase the reliability of this scale upon deletion.

Finally it should be noted that the reliabilities received for the modified NEO FFI test were

nearly the same as the reliabilities publicised in the manual to the test and in other studies.

Compared with the scores of NEO-FFI from the manual (see chapter 1.9.3.) the result obtained

48

in this study were higher by 0.03 in N; 0.07 in E; 0.04 in A and 0.08 in C. As for Openness

scale the results from this study were slightly lower, by 0.03.


The factor analysis was conducted with fixed 5 factors (Varimax rotation was used). The

cumulative variance described by the 5 factors was 43.12%. From this the first factor was

assigned 17% and the other factors 8%, 7%, 6% and 4% respectively.

In-depth analysis of the Rotated Component Matrix (Attachment 1) showed evidence of the

presence of Neuroticism, Conscientiousness, Extraversion, Agreeableness and Openness (in

order of factors). As for the first factor, all items from the Neuroticism scale correlated with it,

most of them highly (mostly above 0.6). There occurred some items from other scales that

correlated moderately negatively. Next, the second factor was related to items form the

Conscientiousness scale. In fact, in this case a perfect pattern was observed, since all items from

this scale correlated with the factor. What is more, all these correlations were high – above 0.6.

On the other hand, 3 items from different scales correlated moderately with this factor as well.

Factor 3 can be described similarly, but with slightly lower average correlations. Again, all

items from Extraversion scale correlated highly with this factor, but additional 4 items from

other scales correlated moderately as well. Factor 4 was correlated to 10 items from

Agreeableness scale and 3 other items. The average correlation was 0.50. As for Factor 5, only

6 out of 12 items from Openness scale correlated with this factor. On the other hand, the

correlations were high, the average correlation was 0.62.

4.2.5. Reliability - test-retest group

The reliability of the normative version of test was also subjected to estimating the differences

in results obtained from the same respondents in the test and re-test, which occurred 3 months

later. Hence, the stability of the measure was tested. Group D was used for comparison with the

total of 46 respondents.

Firstly, a paired sample t-test was conducted for each scale of NEO-FFI comparing the test and

retest results of the subjects. The t-tests did not find significant differences between the two

groups, as expected. This was, however, not true for Extraversion scale, where the test and

49

retest results differed significantly with a t = -3.51, df = 45, p = 0.001. The difference between

the means of the two measurements was 2.2 points (Table 6).

All correlations between the two measurements were high and all of them had a statistically

significant result. The highest correlations were obtained for the Extraversion scale with a very

high 0.86 Pearson’s R. The lowest correlation was on the Agreeableness scale at 0.67. The other

values are listed in Table 5 below.

Correlation Sig.

Pair 1 Neuroticism & Neuroticism_2 0.80 0.00

Pair 2 Extraversion & Extraversion_2 0.86 0.00

Pair 3 Openness & Openness_2 0.76 0.00

Pair 4 Agreeableness & Agreeableness_2 0.67 0.00

Pair 5 Conscientiousness & Conscientiousness_2 0.83 0.00

Table 5 Reliability: Correlation of test and re-test results in Group D (N = 45)

Paired Differences

Mean

Std.

Deviatio

n

t Sig. (2-

tailed)

Pair 1 Neuroticism & Neuroticism_2 -0.48 6,17 -0.53 0.60

Pair 2 Extraversion & Extraversion_2 -2.20 4,24 -3.51 0.00

Pair 3 Openness & Openness_2 -0.04 5,09 -0.06 0.95

Pair 4 Agreeableness & Agreeableness_2 -1.19 4,88 -1.66 0.10

Pair 5 Conscientiousness & Conscientiousness_2 -1.39 5,31 -1.81 0.07

Table 6 Reliability: t-test results for test and re-test total results in Group D (N = 45)

50

4.3. Ipsative data

The ipsative form of test was administered in groups A, B and C. Altogether 169 respondents

filled out the form (not including the duplicates).

4.3.1. Description of ipsative data

The average age of the respondents was estimated at 21.4 years (mean = 21.4, median = 21).

The youngest person was 16 years old, whereas the oldest person was 64 years old, thus the

range was 48. The most respondents were in their early twenties, since 79.5 % of the

respondents were 18 – 24 years old. There were only 14 persons older than 30 years old.

As for the occupation of respondents, most of them were students. In fact 36.3 % explicitly

stated that they were students and another 48.8 % did not state their occupation. The remaining

14.9 % of respondents stated different occupations, none of which was represented by more

than one person.

The representation of gender was similar like in the normative version of test. From the total

32.1 % were males and 64.9 % were females.

4.3.2. NEO FFI Results – Ipsative

The results of means for respective scales were obtained at following values:

N = 17.4, E= 24.08, O= 24.97, A= 25.86, C= 27.63

The medians were very similar. The standard deviations and variances were slightly higher than

in the normative form. Especially high variance was observed in Neuroticism scale N = 146.4

(as seen in Table 7).

The ranges of scores in various scales were different from those observed in normative data,

this was however given by the construction differences of scales in the two forms. For

Neuroticism and Extraversion a rather unexpected result was observed were the scale ranged

from 0 to the total amount 60. For other scales the ranges were also high, with very low

51

minimum scores. The respective ranges, minimums and maximums are described below in

Table 7.

Descriptive statistic of NEO FFI scales for ipsative form

Neuroticism Extraversion Openness Agreeableness Conscientiousness

Mean 17.37 17.37 24.08 24.97 25.86

Median 16.00 16.00 23.00 24.00 25.00

Std. Deviation 12.10 12.10 9.93 8.41 6.46

Variance 146.44 146.44 98.61 70.72 41.78

Range 60.0 60.00 47.00 49.00 42.00

Minimum 0.0 0.00 2.00 3.00 8.00

Maximum 60.0 60.00 49.00 52.00 50.00

Percentiles 25 7.25 17.00 19.00 22.00 20.25

50 16.00 23.00 24.00 25.00 27.00

75 25.00 31.00 30.00 30.00 34.00

Table 7 Descriptive statistic of NEO FFI scales for ipsative form (N = 168)

The frequency graphs of scores of respective scales showed that the scores followed the normal

distribution curve in every scale except for Neuroticism. The frequencies of total scores in

Neuroticism were skewed towards lower values.

Because gender was not equally distributed, the homogeneity of total results of males and

females was analysed. For this analysis comparison of results was conducted using t-tests.

Significant difference was received on Agreeableness scale with a t = 1.99, df = 161 and p =

0.048. Also statistically significant result in Leven’s test for equality was achieved in this scale

(Table 8). These results indicate that the sample is heteroscedastic, therefore a further analysis

was conducted, using the non-parametric Mann-Withney test. This second test did not show

any statistically significant differences. Even though males achieved higher total means on

Agreeableness scale by 2 points, the conclusion is that gender did not significantly affect the

results.

52

Levene's Test for

Equality of

Variances

t-test for Equality of Means

F Sig. t df Sig. (2-

tailed)

Mean

Difference

Neuroticism Equal variances

assumed 0,73 0.39 -0.71 161.00 0.48 -1.44

Extraversion Equal variances

assumed 2,22 0.14 -0.17 161.00 0.86 -0.28

Openness Equal variances

assumed 4,65 0.03 -0.37 161.00 0.72 -0.51

Equal variances

not assumed -0,40 133,03 0,69 -0,51

Agreeableness Equal variances

assumed 0,00 0.97 I.99 161.00 0.05 2,10

Conscientiousness Equal variances

assumed 0,63 0.43 -0.18 161.00 0.85 -0.29

Table 8 t-test: effect of gender on total scores

Next, the influence of age was calculated with a similar procedure as in normative data.

Statistically significant differences were obtained in Neuroticism scale with a t = 2.76, df = 163

and p= 0.01, as well as in Agreeableness scale with a t = -2.94, df =163 and p= 0.01. These

results were also confirmed by a non-parametric Mann-Withney test, which showed significant

difference in Neuroticism scale with p= 0.003, Agreeableness scale with p = 0.001 and in

addition also in Extraversion scale with p = 0.022. The result on the Extraversion scale in t-test

were slightly above significance, p= 0.054 t= -1.94 df= 163. Therefore age did influence the

results on N, E and A scale significantly. On Neuroticism scale “adults” achieved significantly

lower scores, whereas on Conscientiousness scale they achieved significantly higher scores

(Table 9). On the other hand Levene’s test did not show any statistically significant results,

therefore suggesting that the sample did not violate the assumption of homogeneity.

Age_group N Mean Std. Deviation

Neuroticism 1,00 131 18.83 12.24

2,00 34 12.50 10.18

Extraversion 1,00 131 23.28 10.12

2,00 34 26.94 8.34

Openness 1,00 131 25.32 8.27

2,00 34 23.27 8.73

Agreeableness 1,00 131 25.00 6.49

53

2,00 34 28.53 5.11

Conscientiousness 1,00 131 27.45 9.67

2,00 34 28.71 8.18

Table 9 Comparison of means of age groups in ipsative form (1 = young; 2 = adults)

In the ipsative version I also analysed the correlation between scales using Pearson’s R. Because

of the nature of ipsative data the correlations between the scales should be around the

mathematically estimated value, which is -0.25 (since we have 5 scales) (see chapter 1.6.2.2).

In reality, the correlations were different. Even though they were mostly negative (as they

should have been) a slightly positive correlation occurred between Neuroticism and Openness

(0.13), as well as between Extraversion and Agreeability (0.14). Also the correlations do not

add up to -1 as expected. The whole correlation matrix is shown in Table 11Table 10 below.

Neuroticism Extraversion Openness Agreeability Conscientiousness

Neuroticism Pearson

Correlation 1 -0.68** 0.13 -0.40** -0.42**

Extraversion Pearson

Correlation -0.69** 1 -0.28** 0.14 -0.02

Openness Pearson

Correlation 0.13 -0.28** 1 -0.38** -0.50**

Agreeability Pearson

Correlation -0.40** 0.14 -0.38** 1 0.02

Conscientiousness Pearson

Correlation -0.42** -0.02 -0.50** 0.02 1

**. Correlation is significant at the 0.01 level (2-tailed).

Table 10 Pearson´s Correlations among scales of NEO-FFI for ipsative form (N = 168)

4.3.3. Internal consistency

The internal consistencies for respective scales were estimated using Cronbach’s Alphas (the

appropriateness of this analysis is discussed below). For all of the scales the Cronbach’s Alpha

was very high indicating very high internal consistency. The respective coefficients ranged

from 0.67 for the scale of Openness to 0.86 for the scale of Neuroticism. The results are

summarized in Table 11 below.

54

Scale Cronbach's Alpha N of Items

Neuroticism 0.93 12

Extraversion 0.88 12

Openness 0.78 12

Agreeableness 0.68 12

Consciousness 0.86 12

Table 11 Reliability: Cronbach’s alphas for Ipsative form

Upon more detailed analysis of the reliability statistics it can be stated that the items measuring

Neuroticism showed medium inter-item correlations ranging from 0.3 to 0.7 with the average

correlation of relative high 0.5. No extremely high or very low correlations were observed. The

results also showed that no items would increase the reliability if deleted, therefore we can

conclude that all items constituted a highly reliable scale.

The scale of Extraversion followed a similar pattern, with different minimal and maximal inter-

item correlations at 0.17 and 0.7 (mean inter-item correlation was 0.39). Also all the items were

confirmed to add to the internal consistency of the scale, except for question 47, which would

not change the overall reliability if deleted.

In the scale of Openness there appeared also some near zero negative correlations. The lowest

inter-item correlation was -0.02 and the highest was 0.54 with the mean at 0.23. There were

two items that did not increase the internal consistency, namely Question8 and Question33 and

if deleted the consistency would not change.

The items measuring Agreeableness showed slightly lower correlations than in other scales

ranging from -0.04 to 0.44 with a mean of 0.15. As for the “if deleted” analysis, Question29

would very slightly increase the internal consistency if deleted.

Last, the correlation matrix for Conscientiousness revealed a general pattern of medium positive

inter-correlations ranging from 0.15 to 0.56 with a mean of 0.34. No item would increase the

reliability of this scale if deleted.

55


For the Factor analysis the same procedure was used as with normative data (5 fixed factors

and Varimax rotation). The results of factor analysis were, however, different.

The cumulative variance described by the 5 factors was 46.19%. From this the first factor was

assigned 18.5% and the other factors 11%, 7%, 5% and 4.5% respectively.

The Rotated Component Matrix (Attachment Factor Analysis Ipsative data - Rotated

Component Matrix2) showed uninterpretable results. This table contained a mixture of

positive and negative correlations without any meaning. Having said that, factor 1 was the

mixture of items from Extraversion scale that were in average highly positively correlated with

the factor, and of items from Neuroticism scale, which were in average highly negatively

correlated. The second factor yielded high correlations with all items from Conscientiousness

scale, but also many highly negatively correlated items were observed from other scales. Thus

the expectation that the interdependencies of scales will not allow for factor analysis were

observed.

4.3.5. Cluster analysis

Since factor analysis cannot be applied to ipsative data a different approach was considered.

Namely, to conduct cluster analysis with creating clusters from respondents but rather from

variables. This way a reduction of scales can be achieved without using correlations. In cluster

analysis the Euclidean differences of items from each other were measured and based on the

differences the items were grouped into clusters (Romesburg, 2004).14

The cluster analysis showed a nearly perfect pattern of clusters (Table 12). From this table it

can be seen that Cluster 1 contains all questions for Conscientiousness scale; Cluster 2 contains

all questions for Extraversion scale, cluster 3 contains all questions for Neuroticism scale;

cluster 4 contains 9 out of 12 questions for Agreeableness scale and cluster 5 contains all

questions for Openness scale. The only discrepancies observed are such that 3 questions that

14 For this purpose K-means hierarchical Cluster Analysis was conducted using Statistica software.

56

belong to cluster 4 (under Agreeableness) are located in cluster 1. Otherwise the clusters

perfectly modelled the expected factors.

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5

Question4 Question2 Question1 Question9 Question3



Question15 Question17 otakza16 Question24 Question18






Question40 Question47 Question46

Question48

Question45 Question52 Question51 Question53

Question49 Question57 Question56 Question58

Question50

Question55

Question60

Table 12 Cluster Analysis of Ipsative Form – Results

4.3.6. Reliability - test-retest group

For the reliability or (stability of results in time) data from group C were used. Respondents in

this group filled out the ipsative form and after 3 months were retested using the same form

with mixed order of questions.

The number of respondents who filled out both the test and re-test was 45. From this, 9 (20%)

were males and 36 (80%) were women. More than 90% of the respondents in this group were

students of Psychology from Masaryk University. The mean age was 22.4 years, with a variance

of 10 (the youngest was 17 years old and the oldest was 30).

For the comparison of the results a paired sample t-test was conducted. The results showed no

significant differences between the test and retest form (Table 13 and Table 14).

57

Mean N Std. Deviation

Pair 1 Neuroticism 18.47 45.00 13.12

Neurotiicism_2 19.24 45.00 12.90

Pair 2 Extraversion 21.84 45.00 8.90

Extraversion_2 21.27 45.00 9.14

Pair 3 Openness 26.73 45.00 9.85

Openness_2 26.82 45.00 9.78

Pair 4 Agreeability 24.51 45.00 6.80

Agreeableness_2 24.82 45.00 6.58

Pair 5 Conscientiousness 28.22 45.00 11.09

Conscientiousness_2 27.60 45.00 9.80

Table 13 Reliability statistics: Comparison of test and re-test results in Group C (Ipsative Form)

Mean

Std.

Deviation t df

Sig. (2-

tailed)

Pair 1 Neuroticism - Neurotiicism_2 -0,78 5,94 -0,88 44,00 -0,78

Pair 2 Extraversion - Extraversion_2 0,58 4,88 0,79 44,00 0,58

Pair 3 Openness - Openness_2 -0,09 4,23 -0,14 44,00 -0,09

Pair 4 Agreeability - Agreeableness_2 -0,31 5,32 -0,39 44,00 -0,31

Pair 5

Conscientiousness -

Conscientiousness_2 0,62 4,67 0,89 44,00 0,62

Table 14 Reliability statistics: t-test results of test and re-test in Group C (Ipsative Form)

4.4. Comparing ipsative and normative data

For the comparison of results from ipsative form and normative form groups A and B were

used. Altogether data from 93 respondents was analysed. Since the comparison of normative

with ipsative results is complicated I used several methods for each step to confirm the findings.

First non-statistical methods were used such as scatter plots, next correlation coefficients were

estimated using non-parametric as well as parametric coefficients, then I calculated the

58

positions (rank or percentile) of each individual in the whole sample for each scale and I

compared the positions obtained from normative and ipsative form. In the last part I analysed

each item of the NEO inventory separately in order to find out whether the results obtained in

various forms were different for the same item. Finally I compared the ranking of scales for

each individual in the ipsative form (order of scales) and compared it with the ranking of

normative scales.

4.4.1. Graphical representation of relations between Ipsative and

normative data

The easiest and statistically least controversial method to show the relations of results of

ipsative and normative forms for each respondent is depicting them graphically in a scatter plot.

This way the real data is compared without any constraints in use of improper statistical

methods.

As expected, the total scores of respondents from ipsative and normative forms are closely

related. With the increasing scores from normative form also the scores from ipsative form tend

to rise. After scrutinizing the scatter plots it can be concluded that the relation is the weakest in

Openness scale, whereas the strongest relations are in Neuroticism, and Consciousness (Graphs

1 - 5).

Graph 1 Scatter plot Neuroticism: Normative vs. Ipsative results

0

10

20

30

40

50

60

0 10 20 30 40 50 60

No

rmat

ive

sco

res

Ipsative scores

Neuroticism scale

59

Graph 2 Scatter plot Extraversions: Normative vs. Ipsative results

Graph 3 Scatter plot Openness: Normative vs. Ipsative results

0

10

20

30

40

50

60

0 10 20 30 40 50 60

No

rmat

ive

sco

res

Ipsative scores

Extraversion scale

0

10

20

30

40

50

60

0 10 20 30 40 50

No

rmat

ive

sco

res

Ipsative scores

Openness scale

0

10

20

30

40

50

60

0 10 20 30 40 50 60

No

rmat

ive

sco

res

Ipsative scores

Agreableness scale

60

Graph 4 Scatter plot Agreeableness: Normative vs. Ipsative results

Graph 5 Scatter plot Conscientiousness: Normative vs. Ipsative results

4.4.2. Correlation coefficient

In order to qualify the differences and to quantify the relations between the two forms statistical

methods must be applied. For this purpose the correlation of the total results for each scale was

estimated using Pearson’s R and results were also confirmed using Spearman´s correlation

coefficient. All the correlations were statistically significant. As we can see in Table 15, which

represents the respective correlations, Neuroticism, Extraversion and Conscientiousness all

showed high correlation (0.78). On the other hand scores from Openness scale and

Agreeableness scale showed only moderate correlations (0.56 and 0.54).

Spearman’s coefficient Pearson’s R Count

Neuroticism 0.78 0.78 93

Extraversion 0.78 0.77 93

Openness 0.56 0.59 93

Agreeableness 0.54 0.53 93

Conscientiousness 0.76 0.75 93

Table 15 Correlations between total scores of ipsative and normative form

0

10

20

30

40

50

60

70

0 10 20 30 40 50

No

rmat

ive

sco

res

Ipsative scores

Conscientiousness scale

61

Considering the fact that the scores compared were obtained from exactly the same questions

administered to the same persons in different test forms, we would expect very high

correlations. According to Kline’s guidelines, reliability over 0.7 is acceptable (Kline, 2000).

It is apparent that Neuroticism, Extraversion and Conscientiousness showed high or even very

high correlations. Therefore it can be concluded that the ipsative test form did not differ from

the ipsative test form in total results in these scales. As for scales of Openness and

Agreeableness, their reliability is low and depending on what criterion would be used they could

be said to have poor reliability or are not reliable at all. Therefore it cannot be concluded that

the ipsative form is equivalent to the normative form in these two scales.

4.4.3. Comparison of correlations in test and re-test in various groups

The differences in orders were the following. In the Neuroticism scale, persons differed in their

ranks in average by 12.15 position out of 93. It means that a person who had the 20 lowest score

could have a rank from the 8 lowest to the 32 lowest score in normative form (and vice versa).

The difference in ranks for Extraversion were in average 12.75 positions, which is similarly

high as for Neuroticism. For the Openness scale and Agreeableness scale the differences were

even higher, more specifically 18 and 18.3 positions respectively. For Conscientiousness the

difference in positions was estimated at 13.8. To estimate the closeness of the resulting ranks

Spearman’s correlation was used and it showed high correlation (in Table 16).

As it can be seen, correlations of scales in test and re-test version were high in all groups. Only

for Openness and Agreeableness scale in Group A + B they appeared to be slightly lower (0.56

and 0.54) than in other scales. It must be noted that Agreeableness scale had lowest correlations

of all scales and for all three groups it was around 0,6. As for Openness, the correlation scores

in Groups C and D were much higher than in Group A + B. Otherwise, there are no other notable

differences in the correlations or in the distances.

To compare these result among groups, in Group C the correlations between the ranks from test

and re-test were notably higher than in the comparing group (A + B). The average differences

in ranks are shown in Table 16 (the numbers in the prentices are the real obtained differences,

but since the sample size was double in the first group, the results were adjusted accordingly).

62

As for Group D, correlations between ranks in test and re-test were also high, but they were in

between Group C and Group A+B.

As it is visible from the table below, the correlations were lowest when comparing ranks in

ipsative form to normative form. The absolute distances between the ranks were, on the other

hand, the highest in this group. Still, the final correlations suggest that the forms are similar.

Group A + B (Ipsative vs.

Norm) Group C (Ipsative) Group D (normative)

Spearman

Average

Difference Spearman

Average

Difference Spearman

Average

Difference

N 0.78 12.23 0.91 8.61 (4.31) 0.76 13.20 (6.60)

E 0.78 12.75 0.81 10.01 (5.50) 0.83 11.10 (5.60)

O 0.56 18.04 0.91 8.41 (4.20) 0.79 13.70 (6.90)

A 0.54 18.34 0.66 15.42 (7.71) 0.63 17.10 (8.50)

C 0.76 13.78 0.87 9.02 (4.51) 0.82 11.30 (5.70)

Table 16 Correlations and distances between ranks from test and re-test in each group

4.4.4. Analysis of items – reliability of separate items

For more detailed comparison of ipsative versus normative items results obtained in groups A

and B were used. The comparison was conducted using cross-tables with graphical

representation in form of histograms and with resulting correlation coefficients. The main goal

was to find out whether there is a relation between the scoring profiles in the two forms in each

group of items. For this each group of items was studied separately. First a graph of frequencies

was created for every item in a group of items, in order to see how many points respondents

assigned in the ipsative form, if they assigned 1, 2, 3, 4 or 5 points in the normative form.

Assuming that the result were similar in both forms, there should have been high correlation

between the points assigned by respondents to items in a group of items in different forms. To

explain, if respondent No. 1 assigned 1 point to item 1 in the normative version, it is expected

that he also assigned a low point value to this item in the ipsative form (an example of the

statistics used is in Attachment 3).

63

For estimating the strength of relationships non-parametric correlation methods such as

Spearman’s correlation coefficient and Kendall Tau were used, since the scale for points to

assign ranged from 1-5 respectively from 0-5 in ipsative form. The results showed positive

correlations in every item. However, the resulting correlations were not very high, in average

they achieved 0.47 for Spearman’s coefficient and 0.4 for Kendall’s tau. On the other hand, half

of the items obtained Spearman’s correlations higher than 0.5 and only 6 items received

coefficients below 0.3. The table of all correlations is shown in Attachment 4.

4.4.5. Comparison of final rank results

The last step in comparing the forms was to find out whether there is a similarity in the final

ranks of the scales. For this each respondents results from ipsative and normative form were

compared individually. Firstly the ranking of scales in ipsative form were determined using the

total score values. Then the ranking of scales in normative was done. The rankings of ipsative

data were arranged from smallest to largest for each individual and the rankings of scales from

normative form were assigned accordingly. Then the distance between rankings in the scales

were determined by distracting the rank of specific scale obtained in normative form from the

rank of the same scale in ipsative form. The results showed, that out of 93 comparison, 11

respondents had identical ranks in both forms.

The rest of the respondents had different ranks and thus the correlation of the ranks of scales

for the two forms were calculated - in order to estimate the closeness of the rankings. For this

purpose Spearman’s correlation and Kendall’s tau appeared to be the best method. Pearson’s R

can be used with parametrical data only, which was not the case (since the ranks of scales were

ordinal data at most).

The correlations obtained for each individual were averaged. The results were 0.7 correlation

in average for Spearman’s correlation. As for Kendall’s tau the average correlation of

respondents was estimated at a satisfying 0.63. The results showed that 60% of respondents

achieved correlations higher than 0.8 and more than 80% of respondents had correlations higher

64

than 0.5. On the other hand 8 out of 93 respondents obtained zero or negative correlations in

their rankings.15

4.5. Variability of total data

The results did not confirm Hypothesis 2 that the ipsative form will obtain results with greater

variability. Firstly, descriptive analysis of the total results of scales was used. The comparison

of variance indicators of ipsative and normative form is presented in Table 17 below. This table

presents the actual total variability.

Neuroticism Extraversion Openness Agreeableness Conscienceless

Range

normative 41 38 40 38 41

Range ipsative 60 47 49 42 46

ST DEV

normative 9.19 7.98 6.79 6.35 8.67

ST DEV

ipsative 12.10 9.93 8.41 6.46 9.32

Table 17 Frequency table of Standard deviation and Range in Normative an Ipsative form

However, these results were adjusted, since in ipsative form respondents could assign 0-5 points

in each item, whereas in normative form they could assign 1-5 points. The fact that in ipsative

form they could decide for 6 answers and in normative for only 5 answers causes higher

construct variability in favour of ipsative data. Therefore, all the points obtained from

respondents were divided by 6 for ipsative data and by 5 for normative data. This way a unified

point scale was achieved for both tests, so that variability estimators such as standard deviation

could be calculated. The variability and standard deviations in the two forms were comparable

after the unification of point scale (Table 17).

15 For estimating Spearman´s correlation and Kendall’s tau a software plug-in called Analyse-it for Microsoft Excell was used.

65

Neuroticism Extraversion Openness Agreeableness Consciousness

Range normative 8,20 7,60 8,00 7,60 8,20

Range ipsative 10,00 7,83 8,17 7,00 7,67

ST DEV

normative 1.84 1.60 1.36 1.27 1.73

ST DEV

ipsative 2.02 1.66 1.40 1.08 1.55

Table 18 Frequency table of Standard deviation and Range in Normative an Ipsative form (adjusted scores)

Secondly, in order to find out whether there were statistically significant differences between

normative and ipsative form in their variances Levene’s test analysis was used.16 For this the

same adjusted set of data was used, because this set had a unified scale. Levene’s tests showed

significant difference only for Agreeableness scale, with F = 4.92, N = 358 and p = 0.027.

The problem of moderate responding must be explored also using a different method. The

question is, how often respondents chose to give a certain number of points compared to other

numbers in total. For this Graph 6 was created, where the frequencies of points from the total

data (from all respective groups) are shown for both test forms.

Graph 6 Histogram of distribution of points used in total

16 For estimating Levene’s scores a software plug-in called Analyse-it for Microsoft Excell was used.

0 1 2 3 4 5

Ipsative data 15,8% 21,4% 26,8% 22,9% 9,1% 3,9%

Normative data 9% 17% 25% 30% 20%

15,8%

21,4%

26,8%

22,9%

9,1%

3,9%

9%

17%

25%

30%

20%

0,0%

5,0%

10,0%

15,0%

20,0%

25,0%

30,0%

35,0%

CO

UN

T

Frequency of points used in items in total

66

In this graph it is shown, that the points were distributed partially according to the normal

distribution curve. In the normative form, data were skewed towards higher points (4 or 5)

indicating, that most people chose higher values. In the ipsative form on the other hand the

normal curve is skewed towards lower points (0.1.2), which in this case can be the result of

forced choice and especially of the limit on points to distribute.

More importantly, it is visible that in normative form the most frequently chosen points, were

assigned with a high frequency (4 points 30% of times and 3 points 25 % of time), whereas in

ipsative data the frequencies were slightly lower (2 points with 27 % frequency and 3 points

with 23% frequency). On the other hand, clearly the distribution of points was very similar in

both forms, which suggests that the ipsative results did not eliminate the effect of moderate

responding.

4.6. Social desirability

In this experiment there were no specific items or scales used to measure social desirability.

Therefore the analysis of social desirability was conducted using frequency graphs for scales

from different forms. Upon comparing the frequency graphs of means for each scale, it appears

that for some scales the ipsativity of the test did not prevent social desirability answers as

predicted and therefore hypothesis 2 was not proven (however, this is just an estimate). This

was the case of Neuroticism, where the ipsative results did not resemble the shape of the normal

distribution curve, but to the contrary, the histogram resembled rather an inverted logarithmic

function. This pattern is visible from Graph 8. While in the normative form the total score of

respondents were distributed asymmetrically around the mean, in the ipsative form frequencies

decreased with the increasing score.

Graph 7 Neuroticism Normative

Graph 8 Neuroticism Ipsative

67

On the other hand, the opposite was true for Extraversion. In this scale social desirability effect

was indicated in the normative form (graphs 9 and 10), since the histogram is skewed towards

higher values, and the results in ipsative form were more symmetrically distributed. This

indicates that in this case normative data were influenced by social desirability, but not ipsative.

Graph 9 Extraversion Normative Form

Graph 10 Extraversion Ipsative

5. Discussion

5.1. Limitations of study

5.1.1. Respondents The fact that mostly students participated did not influence the results of this study. It would

have been of importance if external validity was tested.

What is important to mention, is the possible negative effect of non-homogenous groups within

the sample, which could influence the t-tests and correlations of test and re-test. In the normative

sample (Groups A, B, D without duplicates) there were 23 persons older than 25 years. Their

results were significantly different from those of younger respondents in the scales Neuroticism,

Extraversion and Conscientiousness. As for the ipsative sample, statistically significant

differences between young and “adults” were observed in C scale. This indicate that different

age groups could influence the total results.

The non-homogeneity could be eliminated by administering the test to a certain age group only

or by simply not including the results of the small subclass (in this case “adults”) in the

evaluation.

68

As for the gender, even though there were two times more female than male respondents it did

not negatively affect the results, because no significant differences were detected between the

total scores of males and females (see chapter 4.2.1).

5.1.2. Administration through internet

Initially there was a problem to distribute the data through the internet because of more reasons.

First, because of greater distance between administrator and respondents lower collaboration

was expected. This was overcome by the fact that data from more than 300 respondents were

selected. Later an experimental mortality of 40% was experienced, but thanks to the high initial

number the planned amount of data to collect was achieved.

The experimental mortality of 40% is considerably high. It can be explained by the lack of

motivation of respondents to participate in the re-test. Since it was clear that the experimental

mortality will be high respondents were motivated by a promise that they will receive their

results in a form of a personality profile with a short description after finishing both test forms.

No financial compensation was offered.

Another problem with administering the test online was with the rather difficult instructions on

how to fill out the ipsative version. However, as It was described in chapter 3.2, this problem

was overcome by including control mechanism into the excel worksheet that was administered

in group A as well as to the online version available on the website, which was administered to

group C and B.

It is true that despite these precautions there were 18 cases out of 186 (nearly 10 %), when

people disregarded the instructions and the control mechanism as well and distributed the point

as on Likert scales. Therefore these result were unusable and the respondents must have been

excluded from the data. On the other hand, it is probable that a similar amount of wrongly filled

tests would appear also if they were administered face-to-face.

5.1.3. Qualitative analysis – some comments from respondents

As from the qualitative viewpoint I received feedback from a couple of respondents. Generally

they disliked the ipsative form, because they could not assign as many points as they wanted to

69

each item in an item set. This was especially true in group B, where they completed first the

normative form and then the ipsative. Some of the respondents even refused to fill out the

ipsative form because of this. In addition they reported problems with assigning high scores to

more items if they thought that these items describe them much better than other items.

Generally, respondents found Form B more demanding, because they had to consider all items

from a set together. Some of them informed me that they had difficulties filling out the form

because after assigning certain points to some items in a set they realised that their sum will

exceed 10 and then they had to start over and reassign the values given to respective items.

Some found this very frustrating.

5.1.4. Distribution of points

Based on the comments from respondents I considered the effect of a different distribution

system. My original goal was to improve the pointing system used by Belbin in his Team roles

test, by increasing the number of points to distribute. This was partially successful, but the

respondents felt still very confined in the distribution. The question that emerged is whether it

would have any effect if the number of points to distribute would be double (20) triple (30) or

half (5) in each item set? I believe that a very little number of points would lead respondents to

basically just ranking the items. That is why I introduced the increased number of points in the

first place – in order to allow for quantification of difference between ranks.

I also believe than increasing the total points to distribute to 20 or 30 would not have a positive

effect on the results, nor would it satisfy the respondents. To explain by 10 points respondents

can already quantify the differences between items. More points could result to greater absolute

differences, but probably would not increase the total difference between scales. For example

assigning 0.2.4.6.8 if the limit was 20, would be exactly the same as assigning 0.1.2.3.4 if the

limit is 10. This however would not be possible if the limit was 5. Furthermore, very high

numbers to distribute would cause additional difficulties for respondents in calculating.

There is one further aspect to consider, namely the limit of 5 points per item. This was set in

order to prevent extreme values in some scales, which would result in extreme values in other

scales from the other pole of the scale. The number 5 was chosen also in order to produce

similar conditions like in the normative form, where the scale ranged from 1 to 5. It would be

interesting to observe how the results would differ if there was no limit.

70

5.2. Properties of the semi-ipsative and normative measure and

applicable statistics

5.2.1. Ipsativity of the hybrid measure

The correlation table in chapter 4.3.2 indicates that the semi-ipsative measure did not show

typical ipsative properties, since the inter-correlations of scales were not all negative

(Extraversion correlated positively with Agreeability (0.144) as well as with Openness (0.128)).

Furthermore, the data did not have the properties of ipsative measures as summarised by

Johnson et al. (1988). Firstly, the sums of rows and columns in a correlation matrix did not

equal 0. In this case, the sums of correlations in rows were very close to 0 in Conscientiousness

and Openness (0.08 and -0.03), close to 0 in Extraversion (0.16) and far from 0 in Neuroticism

(-0.37) and Agreeableness (0.39). Next, the scales should have inter-correlations converging to

-0.25, as indicates the formula of inter-correlation of scales. However, different inter-

correlations were observed. These results indicates that the test lacked some of the typical

ipsative characteristics. Therefore the format of limited distribution of points is mathematically

less ipsative than other comparable measures.

5.2.2. The similarity of the two forms

Various methods were used to assess the similarity or difference of Form A and Form B (scatter

plots, non-parametric correlations, comparison of final ranks, item analysis). All these methods

indicated close linear relations between the two test forms (see chapter 4.4).

Furthermore, the correlations observed were high in all groups. It was also shown that in

ipsative group there were the lowest rank distances and highest correlations. This result was not

expected since more researchers such as Hammond & Barrett (1996) showed that ipsative

measures obtained insufficient test-retest reliability. Therefore it was assumed that the

correlations will not be high. A possible explanation for this result could be the forced choice

format, since this format is based on ordering the items in a set of items. Since the items are

inter-dependent if the respondent in re-test version chose the same value in one item like he did

in the test version, the probability is high that he will chose the same values for the other items

as well. Because he will have less options to rank the remaining items (or distribute the points).

71

On the other hand, this is applies both ways, so in case the respondent in re-test chose different

rank or amount of points on one item than in the test version, it is highly probable that the

discrepancies between the other items (as comparing their results in test and re-test) will be

higher as well. Therefore, this result might have been a coincidence only.

Next, the comparison of final ranks of the scales (chapter 4.4.5) showed that the final ranks

obtained in the two forms also correlated highly. This is yet another indicator of the similarity

of the two form.

5.2.3. Applicability of methods for statistical analysis

Even though in chapter 1.6.2.1 it is stated that means and standard deviations should not be

used with ipsative data, there is reason to believe that this limitation does not apply for the semi-

ipsative measure used. The argument against such methods generally is that ipsative data takes

the form of categorical data or ordinal at most (Baron, 1996). This is true with some types of

ipsative measures, however, the form used in this study would probably satisfy the requirements

of higher level of variables. In fact the data possess comparable qualities to those obtained from

Likert scales (normative form) in terms of use of mathematical operations, except for the

unclear concept of error of measurement and inter-dependencies of scales.

The next argument against the use of means is that the means and standard deviations are not

independent and therefore they cannot be further analysed by methods of CTT (Johnson, Wood,

& Blinkhorn, 1988). It is true that purely ipsative measures, in which only ranking is possible,

would result in distorted results, since the distances between ranks are entirely relative.

However, as it was shown in the results, Form A and Form B were highly correlated, which

means that there is high similarity between ipsative (rather semi-ipsative) and normative forms

used in this study. Also, as mentioned in Chapter The similarity of the two forms5.2.2 the

semi-ipsative format reduced the ipsativity of the measure. Last, the measure did not provide

relative scores only, but partially absolute scores, because respondents did not only order the

items, but also quantified their preferences. Therefore I would suggest that the means and

standard deviations can be calculated in this measure.

Secondly, the use of t-test will be discussed. As mentioned in the introduction, ipsative

measures are intra-individual only, and should not be used for inter-individual comparison

72

(Closs, 1996). Thus parametric tests such as t-test should be out of question. However, for the

justification of their use the previous arguments apply, since t-test are calculated means and

standard deviation. Obviously, this justification does not solve the problem of relativity of data.

The counter-argument for this is that the semi-ipsative measure was not only relative, and

because of the quantification inter-individual comparison should be possible. In addition, in

order to confirm the results obtained from t-tests, also Mann-Whitney’s non-parametric test was

used and it showed the same results. Also, alternative non-parametric comparison tools such as

comparison of the final position ranks of respondents on respective scales were used. All these

indicated similar results like the t-test.

Next, neither correlations based on standard deviations (Pearson’s R) were appropriate to be

used for analysis, because of the interdependencies of scales (Johnson, Wood, & Blinkhorn,

1988). Nonetheless, as discussed in chapter 5.2.1 the ipsativity of this measure was lower than

of other instruments, therefore the correlations should be less influenced and can give relevant

results. When it was necessary to use correlations (to estimate the closeness of relations) non-

parametric correlation coefficients were applied such as Spearman’s correlation or Kendall’s

tau (Pearson’s R was conducted only for comparison). However, as Table 15 shows, the

correlation coefficients estimated by Spearman’s method were very similar to those estimated

by Pearson’s R. They differed only by 0.01 in average. This indicates that for the “semi-

ipsative” form also Pearson’s R would be appropriate for analysis.

Concerning factor analysis, it was conducted only to show that its results are not

uninterpretable. The results confirmed this assumption since the Rotated Correlation Matrix

was perpetuated by illogical correlations. As is stated in chapter 1.6.3 factor analysis cannot be

used with ipsative data under CTT.

Last, the use of cluster analysis is justifiable on the same reasons as the use of non-parametric

correlation coefficients. This method will not substitute factor analysis as for the results, but it

will at least provide an indication about groups of items that belong together.

5.2.4. Reliability of the semi-ipsative vs. normative scale

The test-retest reliability conducted in Group D for normative form showed that the normative

form of the measure was reliable. High correlations were achieved in all scales. On the other

hand, the t-test comparing test and retest results showed significant difference on Extraversion

73

scale. Since this scale showed the highest correlation of all scales between test and retest results,

there is no adequate explanation for this result.

Next, the results showed that the ipsative form had slightly higher internal consistency as

compared to normative form. Nonetheless, as discussed in chapter 1.6.2.2 Cronbach’s alphas

should not be used in ipsative measures at all, because of the inter-dependency of scales, which

influences the final results (Tenopyr, 1988). Having said that, a scale showing high internal

consistency, such as Conscientiousness (0.89) could artificially higher the alphas of other scales

as well. Therefore these results must be interpreted with caution.

As for the reliabilities of individual items described in chapter 4.4.4, the analysis showed

moderate or high correlations. Considering the fact that respondents answered the same items

in the two forms, the expected correlations should be high. On the other hand, this analysis is

much different than the test-retest analysis conducted in chapter 4.2.5. The comparison of the

relations between the points assigned was difficult, since in the ipsative version respondents

had to assign the point by calculating them in groups. The interdependency of items in ipsative

form might have caused that respondents had to assign a different point value than they wanted

to certain items. That is the reason why the correlations obtained are not so high. However, I

believe that in this analysis any coefficient above 0.5 can be considered acceptable.

5.2.5. Advantages of the semi-ipsative measure

As described in chapter 1.7.1 the ipsative format of test can highly influence the response bias

(Mccloy, Heggestad, & Reeve, 2005). If the items are well grouped, the bias can be lower than

in normative tests, otherwise the opposite is true (Hammond & Barrett, 1996). In this study only

the frequency graphs were observed, in order to analyse whether there were any abnormalities

in the distribution of average scores in scales over the tested population. Such irregularities

were observed both in ipsative and normative form. In the ipsative form, respondents attempted

to obtain as low scores as possible on Neuroticism scale (this is indicated by the shape of

frequency distribution in Graph 8). Interestingly, this trend was not observed in normative form.

On the other hand in the normative form, respondents tended to higher scores on extraversion

scale (Graph 9). Therefore the results are not conclusive.

74

Furthermore, the analysis of frequencies of points assigned in various forms from all

respondents did not reveal significant differences between the two forms in this respect. Thus,

the expected reduction of moderate responding was not confirmed.

As for the greater differentiation of profiles as suggested by Tamir (Tamir & Lunetta, 1977),

the analysis of variance of the total results did not provide evidence for it. It must be noted that

the data were adjusted in order to compare the variances, because of different scales.

To summarize, none of the advantages of ipsative testing were observed.

5.3. Improvements

5.3.1. Testing the validity of two forms

Even though the two forms were compared in respect of their psychometric properties, there is

no evidence about their validity. Therefore, we cannot say, which form estimates better the true

scores of respondents.

Initially, in the excel file there was included an evaluation question, in which respondents

indicated on a scale from 1-10 how well the results described them. The intention behind this

was to compare the subjective agreement with the resulting profile in the two forms. This way

the appropriateness (or validity) of each test form could be estimated using quantitative data.

However, the question regarding the subjective evaluation of the result was not asked in the

normative version, because of methodological constraints (and perhaps notably different results

would undermine the remainder of respondents’ faith in the validity of psychological tests).

On the other hand, the estimate of validity is of important question and was extensively

approached by other researches. Since there is no conclusive evidence on which form is more

valid, this remains a fruitful topic for further research.

5.3.2. Adjustment of design in order to reveal response bias

Another suggestion for improving the experimental design is to introduce a validation scale.

This would be especially useful to estimate the effect of response bias in the various forms,

especially Social Desirability. Apart from that, the circumstances of administration could be

manipulated in order to infer desired attributes. This way respondents would be motivated to

achieve high scores on certain scales. The results would reveal if the ipsative format could

reduce the faking.

75

5.3.3. Use of same scale for ipsative and normative data

A major drawback of this experiment was that in Form A a 5 point scale was used, whereas in

Form B a 6 point scale was used for each item. In order to be able to better compare the result

this should be adjusted and same length scales should be used. For example in Form B the

instruction could be to assign 10 points with the maximum limited at 4.

76

6. Conclusion

The data obtained from ipsative form were similar to those obtained from normative form.

Namely, they appeared to be closely related when graphically represented, and this close

relation was verified by Spearman´s correlation coefficient, which in 3 scales was above 0.7

and in the remaining two scales (Agreeableness and Openness) it was above 0.55. Therefore it

was confirmed that this particular semi-ipsative measure is similar to its normative counterpart.

Furthermore the test-retest results showed that the ipsative form had higher correlations after

retesting then normative. As for the second hypothesis, the positive properties of ipsative

measures that were expected (reduction of response bias and higher variability) were not

observed. Concerning the psychometrical properties, some of the statistical methods could be

applied to this specific test form even though are not generally applicable to ipsative measures

(e.g. standard deviations and t-tests, correlation coefficients etc.). This is not true for factor

analysis, where the results appeared to be uninterpretable. On the other hand, it can be partially

replaced by cluster analysis, which showed promising results. Still, no benefits of the ipsative

measure were confirmed, therefore I would not suggest the use of ipsative (or semi-ipsative)

measures in psychological measurement (at least not in classical test theory approach).

77

References

Allen, M. J., & Yen, W. M. (1979). Introduction to Measurement Theory. Belmont: Wadsworth.

Baron, H. (1996). Strenghts and limitations of ipsative measurement. Journal of Occupational and

Organizational psychology, 49-56.

Bartram, D. (1996). The Relationship Between Ipsatized and Normative Measures of Personality.

Journal of Occupational Psychology, 69, 25-39.

Belbin, M. (2003). Team roles at work (1 ed.). Amsterdam: Elsevier. doi:ISBN: 0-7506-2675-5

Belbin, M. (2013). Method, Reliability & Validity, Statistics & Research: A Comprehensive Review of

belbin Team Roles. London: Belbin. Retrieved from

http://www.belbin.com/content/page/4343/BELBIN%28uk%29-2013-

A%20Comprehensive%20Review.pdf

Boring, E. G. (1950). A History of Experimental Psychology (2nd ed.). New York: Appleton-Century-

Crofts.

Borkowski, T. (1989). Validation Review. Thames Ditton: Saville & Holdsworth Ltd.

Brown, A. (2010). How Item Response Theory can Solve Problems of Ipsative Data. Barcelona:

Department of Personality, Evaluation and Psychology.

Burns, R. P., & Burns, R. (2009). Business Research Methods and Statistics Using SPSS. (1), 560.

London: SAGE Publications ltd. Retrieved April 20, 2015, from

http://www.uk.sagepub.com/burns/website%20material/Chapter%2023%20-

%20Cluster%20Analysis.pdf

Cano-García, F. J., Padilla-Muňoz, E. M., & Carasco-Ortiz, M. (2005). Personality and contextual

variables in teacher burnout. Personality and Individual Differences , 4(38), 929-940.

Cattell, R. B. (1944). Psychological Measurement: Normative, Ipsative, Interactive. Psychological

Review(51), 291-302.

Clemans, W. V. (1966). An Analytical and Empirical Examination of some properties of ipsative

measures. Psychometric Monograph No. 14, Richmond, VA: PSychometic Society., 14.

Retrieved from http://www.psychometrika.org/journal/online/MN14.pdf

Closs, J. (1996). On the factoring and interpretation of ipsative data. Journal of Occupational an

Organizational Psychology, 41-47.

Conrad, M. A. (2006). Aptitude is not enough: How personality and behavior predict academic

performance. Journal of Research in Personality, 3(40), 339-346.

doi:10.1016/j.jrp.2004.10.003

Cornwell, J. M., & Manfredo , P. A. (1994). Kolb´s Learning Style Theory Revisited. Educational and

Psychological Measurement, 54(2), 317-327.

Cornwell, J., & Dunlap, P. (1991, 67). On the questionable soundness of factoring ipsative data: A

response to Saville & Willson. Journal of Occupational and Organizational Psychology, pp. 89-

100.

78

Digman, J. M. (1990). Personality structure: Emergence of the five-factor model. Annual Review of

Psychology, 41, 417-440.

Drost, E. A. (2012, July). Validity and Reliability in Social Science Research. Retrieved 4 20, 2015, from

http://www.erpjournal.net/wp-content/uploads/2012/07/ERPV38-1.-Drost-E.-2011.-Validity-

and-Reliability-in-Social-Science-Research.pdf

Emmel, N. (2013). Sampling and choosing cases in qualitative research: arealist approach (1 ed.).

London: Sage. doi:ISBN 978-0-85702-510-4

Frederick, R., & Foster, H. (1991). Multiple Measures of Malingering on a Forced-Choice Test of

Cognitive Ability. Psychological Assessment: A Journal of Consulting and Clinical Psychology,

3(4), 596-602.

Furnham, A., Steele, H., & Pendleton, D. (1993). A psychometric assessment of the Belbin Team-Role

Self-Perception Inventory. Journal of Occupational and Organizational Psychology, 66, 245-

257.

Furr, M. R. (2014). Psychometrics An Introduction (2 ed.). California: SAGE Publications, Inc.

Gabor, M. R. (2013). Q Methodology (Q Factor Analysis) - Particularities and Theoretical

Considerations for Marketing Data. International Journal of Arts and Commerce, 2(4), 116-

126. Retrieved from

http://www.ijac.org.uk/images/frontImages/gallery/Vol._2_No._4_April_2013/10.pdf

Geisinger, K. (2012). Norm- and Criterion-Referenced Testing. In H. Cooper (Ed.), APA Handbook of

Research Methods in PSychology (pp. 371-394). Washington: American Psychological

Association.

Geldhof, J., Steinunn, G., & Stefansson, K. (2014). Selection, Optimization and Compensation: The

Structure, Reliability and Validity of Forced-Choice versus Likert-type Measures in a Sample

of Late Adolescents. International Journal of Behavioral Development, 1-15. Retrieved from

ijbd.sagepub.com

Gibbons, P. J. (1995). Hypothesis Testing in Personal Questionnaire Validation Research . European

Congress of Psychology.

Gordon, L. V. (1976). Survey of Interpersonal Values: Revised Manual. Chicago: IL: Science Researhc

Associates.

Gregory, R. J. (1991). Psychological Testing: History, Principles and Applications (Vol. 1). Boston:

Pearson Education.

Guilford, J. P. (1952). When not to Factor Analyze. Psychological Bulletion, 49(1), 26-37.

Guilford, J. P. (1954). Psychometric methods (Vol. 2). New York: McGraw-Hill.

Hammond, S., & Barrett, P. (1996). The Psychometric and Practical Implications of the use o Ipsative,

Forced-choice Frommat, Questionnaires. The British Psychological Society: Occupational

Psychology Conference, Book of proceedings, 1(1), 135-144.

Hendl, J. (2004). Přehled statistických metod zpracování dat: analýza a metaanalýza dat. Praha:

Portál: Vyd. 1. 583 s. ISBN 8071788201.

79

Hicks, L. E. (1970). Some Properties of Ipsative, Normative and Forced-Choice Normative Measures.

Psychological Bulletin, 74(3), 167-184.

Howell, D. (2013). Statistical Methods for Psychology (8 ed.). Wadsworth: Cengage Learning.

doi:ISBN-10: 1-111-84085-7

Hřebíčková, M. (1997). Jazyk a osobnost: pětifaktorová struktura popisu osobnosti (1 ed.). Brno:

Masarykova univerzita ve spolupráci s Psychologickým ústavem AV ČR. doi:ISBN 8021015225

Hřebíčková, M. (2011). Pětifaktorový model v psychologii osobnosti: Přístupy, diagnostika, uplatnění

(1 ed.). Praha: Grada Publishing.

Chen, Chen, H. T., Underhil, C. M., & Bearden, R. (2008). Evaluation of the Fake Resistance of a

Forced-choice Paired-comparison Computer Adaptive Personality Measure. Millington: Navy

PErsonnel REsearch, Studies and Technology Division.

Jackson, D. N., Worbelski, V. R., & Ashton, M. C. (2000). The Impact of Faking on Employment tests:

Does forced-choice offer a solution? . Human Performance, 13, 371-388.

Johnson, C., Wood, R., & Blinkhorn, S. F. (1988). Spuriouser and spuriouser: The use of ipsative

personality tests. Journal of Occupational Psychology, The British Psychological Society, pp.

152-162.

Kahneman, D. (2012). Thinking, Fast and Slow (1 ed.). London: Penguin Books. doi:ISBN: 978-0-14-

103357-0

Kline, P. (2000). Handbook of Psychological Testing (2 ed.). London: Routledge.

Kopelman, R., & Rovenpor, J. (2006). Allport-Vernon-Lindzey Study of Values. In Encyclopedia of

Career Development (p. [online]). SAGE Publications. Retrieved April 15, 2015, from

http://www.sage-ereference.com/view/careerdevelopment/n6.xml

Kurtz, J., & Parrish, C. (2010). Semantic Response Consistency and Protocol Validity in Structured

Personality Assessment: The Case of the NEO-PI-R. Journal of Persinality Assessment, 76(2),

315-332.

Loo, R. (1999). Issues in Factor-Analyzing Ipsative Measures: The Learning Style Inventory Example.

Journal of Business and Psychology, 14(1).

Mccloy, R. A., Heggestad, E. D., & Reeve, C. (2005). A Silk Purse From the Sow’s Ear: Retrieving

Normative Information From Multidimensional Forced-choice Items. Organizational Research

Methods, 8(2), 222-248. Retrieved from http://www.pabst-publishers.de/psychology-

science/3-2006/ps_3_2006_209-225.pdf

Meade, A. (2004). Psychometric problems and issues involved with creating and using ipsative

measures for selection. Journal of Occupational and Organizational Psychology (2004), 531-

552.

Michel, J. (2008). Is Psychometrics Pathological Science? Measurement: Interdisciplinary Research

and Perspectives, 6(1), 7-24. Retrieved from http://dx.doi.org/10.1080/15366360802035489

Novick, M. R. (1965). The axioms and principal results of classical test theory. Educational Testing

Sevice Research Bulletin, 1965(1), 1-18.

80

Ones, D., & Viswesvaran, C. (1999). Meta-analyses of Fakability Estimates: Implications for

Personality Measurement. Educational and Psychological Measurement, 59(2), 197-210.

Oppenheim, A. N. (1996). Questionnaire Design, Interviewing and Attitude Measurement (2 ed.).

London: Continuum.

Paulhus, D. L. (2002). Socially desirable responding: The evolution of a construct. In The role of

constructs in psychological and educational measurements (pp. 49-69). Erlbaum: H.I. Braun.

D.N. Jackson, D.E. Wiley (Eds.).

Ramlo, S., & Newman, I. (2010). Classifying individuals using Q Methodological Analysis: Applications

of two mixed methodologies for program evaluation. Journal of Research in Education, 21-

31(2), 20. Retrieved from

http://www.academia.edu/2215077/Classifying_individuals_using_Q_Methodology_and_Q_

Factor_Analysis_Applications_of_two_mixed_methodologies_for_program_evaluation

Romesburg, C. (2004). Cluster Analysis for Researchers (1 ed.). North Carolina: LULU Press. doi:ISBN

1-4116-0617-5

Rust, J. (2008, August). Psychometrics. Retrieved from First Psychometric Laboratory: "The Birth of

Psychometrics in Cambridge": http://www.psychometrics.cam.ac.uk/about-us/our-

history/first-psychometric-laboratory

Saville, P., & Willson, E. (1991). The Reliability and Validity of Normative and Ipsative Approaches in

the Measurement of Personality. Journal of Occupational and Organizational Psychology, 64,

219-238.

Shrout, P. (2012). Reliability. In C. Harris (Ed.), APA Handbook of Research Methods in Psychology (pp.

643-660). Washington: American Psychological Association.

Stevens, S. S. (1946). On the Theory of Scales of Measurement. Science, New Series, Vol. 103, No.

2684, 677-680.

Tamir, P., & Lunetta, V. (1977). A comparison of Ipsative and Normative Procedures in the Study of

Congnitive Preferences. American Educational Research Association, 4-8.

Tenopyr, M. L. (1988). Artifactual Reliability of Forced-Choice Scales. Journal of Applied Psychology,

73(4), 749-751.

Terracciano , A., Costa, P. T., & McCrae, R. R. (2006). Personality plasticity after age 30. Pesrsonal and

Social Psychology Bulletin, 32(8), 999-1000. doi:10.1177/0146167206288599

Traub, R. E. (1997). Classical Test Theory in Historical Perspective. The Ontario Institute for Studies in

Education of the University of Toronto, 16(4), 8-14.

Urbánek, T., Denglerová, D., & Širuček, J. (2011). Psychometrika - Měření v Psychologii (Vol. 1). Praha:

Portál. doi:ISBN 978-80-7367-836-4

Vogt, P. W. (2011). Dictionary of Statistics & Methodology: a Nontechnical Guide for the Social

Sciences (4th ed.). Thousand Oaks: SAGE. doi:ISBN 9781412971096

Vries, A. L., & Van der Ark, A. L. (2008). Scoring Methods for Ordinal Multidimensional Forced-Choice

Items. Maastricht: Tilburg University. Retrieved from http://dugi-

doc.udg.edu/bitstream/handle/10256/744/VriesArknew2b.pdf?sequence=1

81

Waters, C. W. (1964). Construction and validation of a Forced-Choice Over- and Under- Achievement

Scale. Educational and Psychological Measurement, 24(4), 921-928.

doi:10.1177/001316446402400419

White, L. A., & Young, M. C. (1986). Development and Validation of the Assessment of Individual

Motivation. Paper presented at the annual meeting of the American Psychological

Association. San Francisco: CA.

Wright, S. S., & Miederhoff, P. A. (1999). Selecting Students with Personal Characteristics Relevant to

Pharmaceutical Care. American Journal of Pharmaceutical Education, 63, 132-138.

List of tables

Table 1 : Within subject design ............................................................................................................. 43

Table 2: Descriptive statistics of NEO-FFI scales for normative sample (N = 214) ............................. 45

Table 3 Descriptive statistics comparing "Young" vs. "Adult" respondents ......................................... 46

Table 4 Reliability statistics: Crombach Alphas for Normative form ................................................... 47

Table 5 Reliability: Correlation of test and re-test results in Group D (N = 45) ................................... 49

Table 6 Reliability: t-test results for test and re-test total results in Group D (N = 45) ........................ 49

Table 7 Descriptive statistic of NEO FFI scales for ipsative form (N = 168) ....................................... 51

Table 8 t-test: effect of gender on total scores ...................................................................................... 52

Table 9 Comparison of means of age groups in ipsative form (1 = young; 2 = adults) ........................ 53

Table 10 Pearson´s Correlations among scales of NEO-FFI for ipsative form (N = 168) .................... 53

Table 11 Reliability: Cronbach’s alphas for Ipsative form ................................................................... 54

Table 12 Cluster Analysis of Ipsative Form – Results .......................................................................... 56

Table 13 Reliability statistics: Comparison of test and re-test results in Group C (Ipsative Form) ...... 57

Table 14 Reliability statistics: t-test results of test and re-test in Group C (Ipsative Form) ................. 57

Table 15 Correlations between total scores of ipsative and normative form ....................................... 60

Table 16 Correlations and distances between ranks from test and re-test in each group ...................... 62

Table 17 Frequency table of Standard deviation and Range in Normative an Ipsative form ................ 64

Table 18 Frequency table of Standard deviation and Range in Normative an Ipsative form (adjusted

scores) .................................................................................................................................................... 65

Table 19: Inter-item correlations between ipsative and normative form ............................................... 87

82

Attachments

1. Factor Analysis Normative data - Rotated Component Matrix

Rotated Component Matrixa

Component

1 2 3 4 5

Question 21 0.78

Question 31 0.75

Question 26 0.73

Question 6 0.70

Question 11 0.68

Question 16 0.66

Question 51 0.66

Question 46 0.64 -0.38

Question 41 0.59

Question 56 0.56

Question 36 0.49

Question 33 0.47

Question 3 0.40

Question 1 0.39

Question 20 0.80

Question 25 0.71

Question 60 0.70

Question 55 -0.35 0.69

Question 10 0.69

Question 45 0.69

Question 35 0.64 0.30

Question 50 0.64

Question 40 0.64

Question 5 0.59

Question 15 0.58

Question 30 0.56

Question 22 0.75

Question 17 0.72

Question 12 -0.36 0.67

Question 37 -0.40 0.63

Question 2 0.61

Question 52 0.42 0.58

Question 32 0.57

Question 27 0.51

Question 7 0.50

Question 42 -0.46 0.49

Question 47 0.49

Question 8 0.36 0.42 0.37

Question 34 0.40 0.31

Question 28 0.36 0.35

Question 57 0.31

Question 19 0.67

Question 49 0.61

Question 4 0.56

Question 39 0.53

Question 14 0.51

Question 54 0.49

Question 24 0.37 0.46

Question 9 -0.31 0.46

Question 44 0.40

Question 59 0.38

Question 18 0.37

Question 38

Question 29

Question 58 0.67

Question 48 0.65

Question 43 0.64

Question 23 0.64

Question 13 0.62

Question 53 0.35 0.54

83

2. Factor Analysis Ipsative data - Rotated Component Matrix

Rotated Component Matrixa

Component

1 2 3 4 5

otazka46 -0.81

otazka12 0.81

otazka42 0.79

otazka37 0.75

otazka26 -0.74

otazka6 -0.72 -0.31

otazka31 -0.72

otazka21 -0.70

otakza16 -0.69

otazka11 -0.67 -0.31

otazka51 -0.66 -0.37

otazka1 -0.66

otazka2 0.63 -0.31

otazka41 -0.61 -0.40

otazka32 0.61

otazka52 0.61 -0.31

otazka17 0.59 -0.32

otazka22 0.57 -0.33

otazka7 0.55

otazka57 0.52 -0.32

otazka27 0.50

otazka36 -0.50 -0.33

otazka56 -0.47 -0.31

otazka34 0.45 0.38

otazka25 0.70

otazka20 0.68

otazka30 0.67

otazka15 0.66

otazka55 0.66

otazka35 0.64

otazka10 0.61 -0.37

otazka5 0.61

otazka50 0.59

otazka45 0.58

otazka60 0.49

otazka40 0.49

otazka33 -0.35 -0.39

otazka13 -0.66

otazka23 -0.64

otazka58 -0.35 -0.63

otazka48 -0.62

otazka43 -0.60

otazka53 -0.51

otazka47 0.34 0.46

otazka8 -0.42

otazka3 -0.34 -0.41 -0.42

otazka38 -0.38 -0.37

otazka28 -0.33 -0.32

otazka49 0.65

otazka54 0.54

otazka9 0.50 0.34

otazka4 0.43 0.45

otazka59 0.42

otazka19 0.32 0.41

otazka24

otazka39 0.75

otazka14 0.61

otazka44 0.34

otazka18 -0.32

otazka29

a. Rotation converged in 11 iterations.

84

3. Example of statistics used for Item analysis

Case Processing Summary

Cases

Valid Missing Total

N Percent N Percent N Percent

Quetion5 *

Question5_2 93 91.2% 9 8.8% 102 100.0%

Question5 * Question5_2 Crosstabulation

Question5_2 Total

1.00 2.00 3.00 4.00 5.00

Question

5

,0 Count 2 4 4 0 1 11

Expected Count ,6 2.4 3.3 3.3 1.4 11.0

% within Question5 18.2% 36.4% 36.4% 0.0% 9.1% 100.0%

% within

Question5_2 40.0% 20.0% 14.3% 0.0% 8.3% 11.8%

% of Total 2.2% 4.3% 4.3% 0.0% 1.1% 11.8%

1.0 Count 2 13 7 2 0 24

Expected Count 1.3 5.3 7.2 7.2 3.1 24.0

% within Question5 8.3% 54.2% 29.2% 8.3% 0.0% 100.0%

% within

Question5_2 40.0% 65.0% 25.0% 7.1% 0.0% 25.8%

% of Total 2.2% 14.0% 7.5% 2.2% 0.0% 25.8%

2.0 Count 1 3 11 12 2 29

Expected Count 1.6 6.2 8.7 8.7 3.7 29.0

% within Question5 3.4% 10.3% 37.9% 41.4% 6.9% 100.0%

% within

Question5_2 20.0% 15.0% 39.3% 42.9% 16.7% 31.2%

% of Total 1.1% 3.2% 11.8% 12.9% 2.2% 31.2%

3.0 Count 0 0 4 11 5 20

Expected Count 1.1 4.3 6.0 6.0 2.6 20.0

% within otazka5 0.0% 0.0% 20.0% 55.0% 25.0% 100.0%

% within Otazka5_2 0.0% 0.0% 14.3% 39.3% 41.7% 21.5%

85

% of Total 0.0% 0.0% 4.3% 11.8% 5.4% 21.5%

4.0 Count 0 0 2 3 3 8

Expected Count ,4 1.7 2.4 2.4 1.0 8.0

% within Question5 0.0% 0.0% 25.0% 37.5% 37.5% 100.0%

% within

Question5_2 0.0% 0.0% 7.1% 10.7% 25.0% 8.6%

% of Total 0.0% 0.0% 2.2% 3.2% 3.2% 8.6%

5.0 Count 0 0 0 0 1 1

Expected Count ,1 ,2 ,3 ,3 ,1 1.0

% within Question5 0.0% 0.0% 0.0% 0.0% 100.0% 100.0%

% within

Question5_2 0.0% 0.0% 0.0% 0.0% 8.3% 1.1%

% of Total 0.0% 0.0% 0.0% 0.0% 1.1% 1.1%

Total Count 5 20 28 28 12 93

Expected Count 5.0 20.0 28.0 28.0 12.0 93.0

% within Question5 5.4% 21.5% 30.1% 30.1% 12.9% 100.0%

% within

Question5_2 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%

% of Total 5.4% 21.5% 30.1% 30.1% 12.9% 100.0%

Directional Measures

Value

Asymp. Std.

Errora Approx. Tb Approx. Sig.

Ordinal by Ordinal Somers' d Symmetric 0.55 0.06 8.82 0.00

Question5 Dependent 0.56 0.06 8.82 0.00

Question5_2

Dependent 0.55 0.06 8.82 0.00

Nominal by

Interval

Eta Question5 Dependent 0.63

Question5_2

Dependent 0.66

a. Not assuming the null hypothesis.

86

b. Using the asymptotic standard error assuming the null hypothesis.

Symmetric Measures

Value

Asymp. Std.

Errora Approx. Tb Approx. Sig.

Nominal by Nominal 0.78 0.000 0.00

0.39 0.000 0.00

0.62 0.000 0.00

Ordinal by Ordinal 0.53 0.06 8.82 0.00 0.00

0.69 0.07 8.82 0.00 0.00

0.64 0.07 8.02 0.00c 0.00

Interval by Interval 0.62 0.07 7.48 0.00c 0.00

N of Valid Cases 93

a. Not assuming the null hypothesis.

b. Using the asymptotic standard error assuming the null hypothesis.

c. Based on normal approximation.

87

4. Inter-item correlation table

Spearman

Kendall

Tau Spearman

Kendall

Tau

Question46 0.71 0.65 Question53 0.50 0.43




























Question35 0.52 0.43 Question1 -0.37 -0.31

Question32 0.51 0.44

Question50 0.50 0.43 Average 0.57 0.49

Table 19: Inter-item correlations between ipsative and normative form

Diploma thesis - IS MUNI

Documents

Transcript of Diploma thesis - IS MUNI