Similarity testing in sensory and consumer research

11
Similarity testing in sensory and consumer research Jian Bi Sensometrics Research and Service, 9212 Groomfield Road, Richmond, VA 23236, USA Available online 5 May 2004 Abstract Similarity testing is an important methodology in sensory and consumer research. It is of practical and theoretical significance. In many comparative experiments in sensory and consumer field, the objective is not to demonstrate difference but to demonstrate equivalence or similarity between treatments. It is widely acknowledged that the conventional hypothesis test with the null hypothesis of ‘‘no difference’’ between two treatments is appropriate for difference testing but inappropriate for similarity testing. Schuirmann [Biometrics 37 (1981) 617] introduced the use of interval hypotheses testing in the context of bioequivalence. For the interval hypotheses testing, the null hypothesis is that the difference between two treatments is larger than a specified non-zero value defining equivalence or similarity and the alternative hypothesis is that the difference is smaller than the specified value. If the null hypothesis is rejected, equivalence or similarity is then concluded. Based on this concept and approach, the present paper discusses some statistical procedures of similarity testing used for different situations in sensory and consumer research. The procedures include similarity testing for hedonic or intensity rating means, similarity testing using conventional discrimination methods (forced choice methods and the methods with response bias) and similarity testing for preference proportions. Ó 2004 Elsevier Ltd. All rights reserved. Keywords: Similarity testing; Interval hypotheses testing; Anderson and Hauck’s non-central t test; Dunnett and Gent’s chi-squared test; Schuirmann’s two one-sided tests 1. Introduction 1.1. Similarity evaluation There are many situations in sensory and consumer research where the objective is not to demonstrate dif- ference but to demonstrate similarity between treat- ments. For example, a manufacturer may replace a chemical with another substance hoping that the fin- ished product will maintain the same perceived intensity of certain sensory characteristics. Proof of exact equality is theoretically impossible. Similarity testing merely demonstrates statistically that the difference between two products being compared is smaller than the allowed difference in intensity or pref- erence. Similarity evaluation is of practical and theo- retical importance. It has wide applications not only in sensory and consumer field (Gacula, 1991; Meilgaard, Civille, & Carr, 1991) but also in many other fields, e.g., clinical and pharmaceutical fields (Metzler & Haung, 1983). It is widely acknowledged that the conventional hypothesis testing used for difference testing is inap- propriate in the context of similarity (see, e.g., Anderson & Hauck, 1983; Berger & Hsu, 1996; Blackwelder, 1982; Dunnett & Gent, 1977; Schuirmann, 1987; Westlake, 1972, 1979). The basic difficulty is that the null hypothesis of no difference can never be proved or established but be possibly disproved according to the logic of statistical hypothesis testing (see, e.g., Blackw- elder, 1982). Absence of evidence of difference is not an evidence of similarity (Altman & Bland, 1995). Lack of significance may merely be the result of inadequate sample size, while a trivial difference may be statistically significant with large sample size. In order to conduct a similarity testing, some new ways of thinking about statistical hypothesis testing and some new statistical models are needed. The problem of similarity evaluation has been dis- cussed extensively since it was introduced to applied statistics by Westlake (1972). Considerable efforts have been made in developing the methodology of similarity evaluation in many fields, especially in clinical and pharmaceutical fields. Many methods have been devel- oped and proposed (see, e.g., Chow & Lui, 1992; E-mail address: [email protected] (J. Bi). 0950-3293/$ - see front matter Ó 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.foodqual.2004.03.003 Food Quality and Preference 16 (2005) 139–149 www.elsevier.com/locate/foodqual

Transcript of Similarity testing in sensory and consumer research

Food Quality and Preference 16 (2005) 139–149

www.elsevier.com/locate/foodqual

Similarity testing in sensory and consumer research

Jian Bi

Sensometrics Research and Service, 9212 Groomfield Road, Richmond, VA 23236, USA

Available online 5 May 2004

Abstract

Similarity testing is an important methodology in sensory and consumer research. It is of practical and theoretical significance. In

many comparative experiments in sensory and consumer field, the objective is not to demonstrate difference but to demonstrate

equivalence or similarity between treatments. It is widely acknowledged that the conventional hypothesis test with the null

hypothesis of ‘‘no difference’’ between two treatments is appropriate for difference testing but inappropriate for similarity testing.

Schuirmann [Biometrics 37 (1981) 617] introduced the use of interval hypotheses testing in the context of bioequivalence. For the

interval hypotheses testing, the null hypothesis is that the difference between two treatments is larger than a specified non-zero value

defining equivalence or similarity and the alternative hypothesis is that the difference is smaller than the specified value. If the null

hypothesis is rejected, equivalence or similarity is then concluded. Based on this concept and approach, the present paper discusses

some statistical procedures of similarity testing used for different situations in sensory and consumer research. The procedures

include similarity testing for hedonic or intensity rating means, similarity testing using conventional discrimination methods (forced

choice methods and the methods with response bias) and similarity testing for preference proportions.

� 2004 Elsevier Ltd. All rights reserved.

Keywords: Similarity testing; Interval hypotheses testing; Anderson and Hauck’s non-central t test; Dunnett and Gent’s chi-squared test;

Schuirmann’s two one-sided tests

1. Introduction

1.1. Similarity evaluation

There are many situations in sensory and consumer

research where the objective is not to demonstrate dif-

ference but to demonstrate similarity between treat-

ments. For example, a manufacturer may replace a

chemical with another substance hoping that the fin-

ished product will maintain the same perceived intensityof certain sensory characteristics.

Proof of exact equality is theoretically impossible.

Similarity testing merely demonstrates statistically that

the difference between two products being compared is

smaller than the allowed difference in intensity or pref-

erence. Similarity evaluation is of practical and theo-

retical importance. It has wide applications not only in

sensory and consumer field (Gacula, 1991; Meilgaard,Civille, & Carr, 1991) but also in many other fields, e.g.,

clinical and pharmaceutical fields (Metzler & Haung,

1983).

E-mail address: [email protected] (J. Bi).

0950-3293/$ - see front matter � 2004 Elsevier Ltd. All rights reserved.

doi:10.1016/j.foodqual.2004.03.003

It is widely acknowledged that the conventional

hypothesis testing used for difference testing is inap-propriate in the context of similarity (see, e.g., Anderson

& Hauck, 1983; Berger & Hsu, 1996; Blackwelder, 1982;

Dunnett & Gent, 1977; Schuirmann, 1987; Westlake,

1972, 1979). The basic difficulty is that the null

hypothesis of no difference can never be proved or

established but be possibly disproved according to the

logic of statistical hypothesis testing (see, e.g., Blackw-

elder, 1982). Absence of evidence of difference is not anevidence of similarity (Altman & Bland, 1995). Lack of

significance may merely be the result of inadequate

sample size, while a trivial difference may be statistically

significant with large sample size. In order to conduct a

similarity testing, some new ways of thinking about

statistical hypothesis testing and some new statistical

models are needed.

The problem of similarity evaluation has been dis-cussed extensively since it was introduced to applied

statistics by Westlake (1972). Considerable efforts have

been made in developing the methodology of similarity

evaluation in many fields, especially in clinical and

pharmaceutical fields. Many methods have been devel-

oped and proposed (see, e.g., Chow & Lui, 1992;

140 J. Bi / Food Quality and Preference 16 (2005) 139–149

Metzler & Haung, 1983, for an overview). The methods

include the confidence interval approach (see, e.g., Carr,

1995; MacRae, 1995; Westlake, 1972, 1976), the interval

hypotheses testing (see, e.g., Anderson & Hauck, 1983;Patel & Gupta, 1984; Rocke, 1984; Schuirmann, 1981,

1987), the Bayesian approach (see, e.g., Mandallaz &

Mau, 1981; Rodda & Davis, 1980; Selwyn, Dempster, &

Hall, 1981) and non-parametric methods (see, e.g.,

Hauschke, Steinijans, & Diletti, 1990; Rashid, 2003;

Steinijans & Diletti, 1983). For the connection between

similarity tests and confidence intervals, see, e.g.,

O’Quigley and Baudoin (1988), Hsu, Hwang, Lui, andRuberg (1994) and Berger and Hsu (1996).

This article concentrates on interval hypotheses test-

ing for assessing similarity. The objective of the article is

to discuss similarity testing with four specific application

areas in sensory and consumer research: hedonic or

intensity ratings, forced choice methods, A–Not A/

Same–Different methods and consumer preference pro-

portions. The article is motivated by the fact that someinappropriate procedures, e.g., so-called ‘‘power ap-

proach’’ that has been abandoned in some fields (Senn,

1997, p. 319–320), are still used widely in sensory and

consumer field, while some valid concepts and proce-

dures, e.g., the interval hypotheses testing, which has

been extensively discussed and well developed in clinical

and pharmaceutical fields, are not well-known in sen-

sory and consumer field.

1.2. ‘‘Power approach’’

In sensory and consumer field, the widely used method

for similarity testing is so-called ‘‘power approach’’ (the‘‘power’’ here refers to difference testing power rather

than similarity testing power). Using the power ap-

proach, a small Type II error ðbÞ, i.e., a large power

ð1� bÞ value is selected for a specified allowed difference,D0. A sample size is then determined to ensure the large

power to detect the difference. If the null hypothesis of no

difference is not rejected, similarity is then concluded.

This approach is based on the logic that if a difference islarger than a specified allowed difference, the difference

should likely be detected and the null hypothesis of no

difference should likely be rejected. On the other hand, if

a difference is smaller than a specified allowed difference,

the null hypothesis should likely not be rejected.

At one time, the ‘‘power approach’’ was a standard

method in bioequivalence testing. However, due to its

unsuitability, the approach was finally abandoned inevaluation requirements of the US Food and Drug

Administration (FDA, 1992, Chap. 320). Some authors,

e.g., Schuirmann (1987) have shown in a detailed

examination that the power approach is quite inade-

quate for similarity testing. One reason is that this ap-

proach contorts the logic of hypothesis testing.

According to the logic, we cannot prove and accept the

null hypothesis in any situation. Furthermore, one

weakness in this method is that for a large sample size

and a small measurement error, it is unlikely to draw a

conclusion of similarity even for a slight difference buteffective equivalence. For example, in a similarity test

using the triangle method, if a ¼ 0:1, b ¼ 0:05 are se-

lected for a specified proportion of discriminators,

pd ¼ 0:3 (i.e., the proportion of correct responses, pc ¼0.533), the required sample size is 54 (see, ASTM E1885-

97). Assuming that the true proportion of discriminators

is pd ¼ 0:1 (i.e., pc ¼ 0:4), a reasonable similarity test

should confirm the similarity with high probability.However, a simulation experiment, using 5000 sets of

binomial data with sample size n ¼ 54 and parameter

p ¼ 0:4, shows that this similarity testing, has only 0.49of chance to confirm the similarity. If the sample size is

540 instated of 54, using the same testing procedure, the

chance to get a conclusion of similarity is only 0.02. The

simulation results show that the more the sample size,

the smaller the probability to get the conclusion ofsimilarity regardless of a small true difference. Hence the

power approach for similarity testing is problematic.

1.3. Interval hypotheses testing and confidence interval

It is consensus in statistical literature that the interval

hypotheses testing is a suitable approach for similaritytesting (see, e.g., Anderson & Hauck, 1983; Barker,

Rolka, Rolka, & Brown, 2001; Chow & Lui, 1992;

Schuirmann, 1981, 1984; Rocke, 1984; Schuirmann,

1987). In this approach the null hypothesis is that the

difference of two treatments is equal to or larger than a

specified allowed difference and the alternative hypoth-

esis is that the difference is smaller than the specified

value, i.e.,

H0 : jl1 � l2jP D0

H1 : �D0 < l1 � l2 < D0

ð1:1Þ

If the null hypothesis is rejected, the alternative

hypothesis (i.e., similarity between the two products forcomparison) can be concluded. In the interval hypoth-

eses testing, Type I error, a, is the probability of con-

cluding similarity when the two treatments are in fact

different. a ¼ 0:05 or 0.1 is usually selected. Type II er-

ror, b, is the probability of failing to conclude similarity

when the two treatments are similar. The power of a

similarity testing, 1� b, is then the probability of cor-

rectly rejecting the null hypothesis of difference andaccepting the alternative hypothesis of similarity when

the two treatments are similar. b ¼ 0:1 or 0.2, i.e.,

1� b ¼ 0:9 or 0.8 is usually selected.

The confidence interval approach can also be used for

similarity evaluation. If the confidence interval of

l1 � l2 is within ð�D0;D0Þ, similarity can be concluded.

However, for the size-a similarity test, the conventional

J. Bi / Food Quality and Preference 16 (2005) 139–149 141

confidence interval is with a 100ð1� 2aÞ% rather than a

100ð1� aÞ% confidence level. Some authors, e.g., Berger

and Hsu (1996) pointed out that the misconception that

size-a similarity tests generally correspond to 100ð1�2aÞ% confidence interval confidence interval leads to

incorrect statistical practices and concluded that the

usage of 100ð1� 2aÞ% confidence interval to evaluate

similarity should be abandoned. Techniques for con-

structing 100ð1� aÞ% confidence intervals that corre-

spond to size-a. similarity tests are developed by Hsu

et al. (1994) and Berger and Hsu (1996).

2. Similarity testing for hedonic or intensity rating means

2.1. Anderson and Hauck’s non-central t test

It is widely accepted in sensory and consumer field

that the hedonic or intensity ratings data using the 9-

point scale can be approximately regarded as continuous

data. Similarity testing for two hedonic or intensityrating means is often needed. The null hypothesis and

alternative hypothesis for the test are as in (1.1).

A procedure that we will discuss in this section is that

proposed by Anderson and Hauck (1983) and Hauck

and Anderson (1984). This procedure can be used to

evaluate the null hypothesis in (1.1) directly. In other

words, we would reject the null hypothesis of difference

for two hedonic or intensity rating means in favor ofsimilarity for a small p-value.

For simplicity, it is assumed here that a completely

randomized design with n subjects per group is used.

Extensions to a crossover design and other designs are

straightforward. The test statistic is TAH in (2.1), which

follows a non-central t distribution with non-centrality

parameter as in (2.2) and degrees of freedom m under thenull hypothesis.

TAH ¼ X 1 � X 2

sffiffiffiffiffiffiffiffi2=n

p ð2:1Þ

d ¼ l1 � l2

rffiffiffiffiffiffiffiffi2=n

p ð2:2Þ

The non-centrality parameter d can be estimated by

(2.3).

d̂ ¼ D0

sffiffiffiffiffiffiffiffi2=n

p ð2:3Þ

where D0 is the specified maximum allowable difference

between two true rating means for similarity; s2 is theerror deviation calculated from the appropriate analysis

of variance with degrees of freedom m. For a completelyrandom design, s2 is the common variance of the two

rating samples X1 and X2, s2 ¼s21þs2

2

2with m ¼ 2ðn� 1Þ

degrees of freedom, s21 and s22 are variances of X1 and X2.

For other design, e.g., for a crossover design, m ¼ n� 2.

The p-value is calculated from (2.4).

p ¼ FmðjtAHj � d̂Þ � Fmð�jtAHj � d̂Þ ð2:4Þwhere Fmð Þ denotes the distribution function of the

central t distribution with degrees of freedom m and tAHdenotes the observed value of test statistic TAH. The nullhypothesis would be rejected in favor of similarity if the

observed p-value is less than the significance level a.

Example 2.1. In order to determine if consumers in twocities (A and B) have the similar overall likings for a

product, 100 panelists were selected in each of the two

cities and 9-point liking scale is used with 9¼ ‘‘Like

extremely’’ and 1¼ ‘‘Dislike extremely’’. The similarity

limit D0 ¼ 0:5 and significance level a ¼ 0:1 were se-

lected. The observed overall liking means and their

variances for the two cities are XA ¼ 7:1, s2A ¼ 2:0;X B ¼ 6:9, s2B ¼ 2:2. Hence the estimate of the commonvariance of the two cities is s2 ¼ 2:0þ2:2

2¼ 2:1, i.e.,

s ¼ 1:45.

The observed value of the test statistic is

TAH ¼ 7:1� 6:9

1:45�ffiffiffiffiffiffiffiffiffiffiffiffiffi2=100

p ¼ 0:976

The estimated non-centrality parameter is

d̂ ¼ 0:5

1:45�ffiffiffiffiffiffiffiffiffiffiffiffiffi2=100

p ¼ 2:44

The calculated p-value is then

p ¼ Fmð0:976� 2:44Þ � Fmð�0:976� 2:44Þ¼ Fmð�1:464Þ � Fmð�3:416Þ ¼ 0:072

where Fmð�1:464Þ ¼ 0:0724 is the probability of the

central t distribution with m ¼ 2� ð100� 1Þ ¼ 198 de-

grees of freedom from �1 to )1.464 and Fmð�3:416Þ ¼Fmð�3:416Þ ¼ 0:0004 is the probability of the central tdistribution with 198 degrees of freedom from �1 to

)3.416. Because the p-value (0.072) is smaller than

a ¼ 0:1, we can conclude at 0.1 of significance level thatthe overall liking for the product is the similar between

the two cities in terms of the similarity limit D0 ¼ 0:5.

2.2. Testing power and sample size

The testing power ð1� bÞ or Type II error, b, for theAnderson and Hauck’s non-central t test is a compli-

cated function of a, n, D0, s2 and D1. For simplification,we consider the power only for D1 ¼ 0, i.e., the power

when the true means of the two products for comparison

are the same. The Type II error, b, can be solved

numerically from (2.5).

FmðC � d̂Þ � Fmð�C � d̂Þ � a ¼ 0 ð2:5Þ

142 J. Bi / Food Quality and Preference 16 (2005) 139–149

where C is the 1� b=2 percentage point of the central tdistribution with m degrees of freedom; d̂2 ¼ nD2

0

2s2 .

The sample size, n, needed for specified a, b, D0, s2

and D1 ¼ 0 can also estimated from (2.5). A computerprogram is needed to estimate testing power and sample

size. A S-PLUS program is available from the author on

request.

Table 1

Maximum number of correct responses for similarity testing using the

2-AFC and Duo-trio methods

n a ¼ 0:05 a ¼ 0:1

pd0 pd0

0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5

5 0 0 0 1 1 0 1 1 1 1

6 0 1 1 1 2 1 1 1 2 2

7 1 1 1 2 2 1 2 2 2 3

8 1 2 2 2 3 2 2 2 3 3

9 2 2 2 3 4 2 3 3 4 4

Example 2.2. For Example 2.1, for a ¼ 0:1, D0 ¼ 0:5,s2 ¼ 1:45, n ¼ 100 (hence m ¼ 2� 100� 2 ¼ 188 for a

completely randomized design), using a S-PLUS pro-

gram based on (2.5), we find the testing power is about

0.9.

We can verify the result as follows: C ¼ tm;1�b=2 ¼t188;0:95 ¼ 1:653; d̂ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffi100�0:522�1:45

q¼ 2:936. According to

(2.5), F188ð1:653� 2:936Þ � F188ð�1:653� 2:936Þ � 0:1 ¼F188ð�1:286Þ � F188ð�4:589Þ � 0:1 0: It means that forthe specified a, D0, s2, and n values, b ¼ 0:1 (i.e.,1� b ¼ 0:9) is an approximate solution of (2.5).

For a ¼ 0:1, D0 ¼ 0:5, s2 ¼ 1:45, the sample size

needed to reach 0.8 of testing power should be about

n ¼ 77 for a completely randomized design using a S-

PLUS program based on (2.5).

We can also verify the result as follows: C ¼ tm;1�b=2 ¼t152;0:9 ¼ 1:287; d̂ ¼

ffiffiffiffiffiffiffiffiffiffiffi77�0:522�1:45

q¼ 2:576. According to

(2.5), F152ð1:287�2:576Þ�F152ð�1:287�2:576Þ�0:1¼F152ð�1:289Þ � F152ð�3:863Þ � 0:1 0. It means that for

the specified a, D0, s2, and b values, n ¼ 77 is an

approximate solution of (2.5).

10 2 2 3 4 4 2 3 4 4 5

11 2 3 3 4 5 3 4 4 5 5

12 3 3 4 5 5 3 4 5 5 6

13 3 4 5 5 6 4 5 5 6 7

14 4 4 5 6 7 4 5 6 7 7

15 4 5 6 6 7 5 6 6 7 8

16 5 5 6 7 8 5 6 7 8 9

17 5 6 7 8 9 6 7 8 8 9

18 5 6 7 8 9 6 7 8 9 10

19 6 7 8 9 10 7 8 9 10 11

20 6 7 8 10 11 7 8 9 10 11

21 7 8 9 10 11 8 9 10 11 12

22 7 8 10 11 12 8 9 10 12 13

23 8 9 10 11 13 9 10 11 12 14

24 8 9 11 12 13 9 10 12 13 14

25 9 10 11 13 14 10 11 12 14 15

26 9 10 12 13 15 10 11 13 14 16

27 10 11 12 14 15 11 12 13 15 16

28 10 12 13 15 16 11 12 14 15 17

29 11 12 14 15 17 12 13 15 16 18

30 11 13 14 16 17 12 14 15 17 18

35 13 15 17 19 21 14 16 18 20 22

40 16 18 20 22 24 17 19 21 23 25

45 18 21 23 25 28 19 22 24 27 29

50 21 23 26 29 31 22 25 27 30 33

60 26 29 32 35 38 27 30 33 36 40

70 31 34 38 42 45 32 36 39 43 47

80 36 40 44 48 53 37 41 46 50 54

90 41 45 50 55 60 42 47 52 56 61

100 46 51 56 61 67 48 53 58 63 68

3. Similarity testing using forced choice methods

3.1. One-sided interval hypotheses testing

The forced choice methods used conventionally for

the discrimination testing can also be used for similarity

testing but a different statistical testing model is needed.

Firstly we need to specify an allowed or ignorable dif-

ference in terms of the proportion or probability of

‘‘discriminators’’, pd0. The probability of correct re-

sponses, pc0, corresponding to pd0 is then calculated,

pc0 ¼ pd0 þ p0ð1� pd0Þ, where p0 is a guessing probabil-ity, p0 ¼ 1=2 for the 2-AFC and Duo-trio tests and

p0 ¼ 1=3 for the 3-AFC and the triangular tests.

The null and alternative hypotheses of the similarity

testing are

H0 : pc P pc0H1 : p0 6 pc < pc0

This is a one-sided test. The test statistic c is the number

of correct responses in a similarity test with sample size

n. The critical number c0 is the maximum value that

satisfied (3.1).

Xc0x¼0

nx

� �pxc0ð1� pc0Þn�x

< a ð3:1Þ

If observed number of correct responses c is smaller

than or equal to a critical number c0, the null hypothesisis rejected and the alternative hypothesis is accepted at a

significance level a. It means that similarity can be

concluded.

Tables 1 and 2 give the critical numbers for similarity

testing using the 2-AFC, Duo-trio, 3-AFC and Trian-

gular methods for a ¼ 0:05 and 0.1, pd0 ¼ 0:1 to 0.5 witha step of 0.1 and for sample size n ¼ 5 to 100.

Example 3.1. There are 100 panelists in a similaritytesting using the 3-AFC method for sweetness of two

product brands. The allowed proportion of ‘‘discrimi-

nators’’ for the method is selected as pd0 ¼ 0:2 and sig-

Table 2

Maximum number of correct responses for similarity testing using the

3-AFC and Triangular methods

n a ¼ 0:05 a ¼ 0:1

pd0 pd0

0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5

5 0 0 0 0 1 0 0 0 1 1

6 0 0 0 1 1 0 0 1 1 1

7 0 0 1 1 2 0 1 1 2 2

8 0 0 1 2 2 0 1 1 2 3

9 0 1 1 2 3 1 1 2 3 3

10 1 1 2 2 3 1 2 2 3 4

11 1 1 2 3 4 1 2 3 4 4

12 1 2 3 3 4 2 2 3 4 5

13 1 2 3 4 5 2 3 4 5 5

14 2 3 3 4 5 2 3 4 5 6

15 2 3 4 5 6 3 4 5 6 7

16 2 3 4 5 7 3 4 5 6 7

17 3 4 5 6 7 3 4 5 7 8

18 3 4 5 6 8 4 5 6 7 8

19 3 4 6 7 8 4 5 6 8 9

20 3 5 6 7 9 4 5 7 8 10

21 4 5 6 8 9 5 6 7 9 10

22 4 5 7 8 10 5 6 8 9 11

23 4 6 7 9 11 5 7 8 10 11

24 5 6 8 9 11 6 7 9 10 12

25 5 7 8 10 12 6 7 9 11 13

26 5 7 9 10 12 6 8 10 11 13

27 6 7 9 11 13 7 8 10 12 14

28 6 8 10 12 13 7 9 11 12 14

29 6 8 10 12 14 7 9 11 13 15

30 7 9 11 13 15 8 10 11 14 16

35 8 11 13 15 18 9 12 14 16 19

40 10 13 15 18 21 11 14 16 19 22

45 12 15 17 21 24 13 16 19 22 25

50 13 17 20 23 27 15 18 21 25 28

60 17 21 25 29 33 18 22 26 30 34

70 20 25 29 34 39 22 26 31 36 41

80 24 29 34 40 45 25 31 36 41 47

90 27 33 39 45 52 29 35 41 47 53

100 31 37 44 51 58 33 39 46 53 60

J. Bi / Food Quality and Preference 16 (2005) 139–149 143

nificance level is a ¼ 0:05. The observed number of

correct responses in the test is 35.

We can find from Table 2 that c0 ¼ 37, which is

the maximum value forP37

x¼0100

x

� �� 0:4667xð1�

0:4667Þ100�x< 0:05, where pc0 ¼ 0:2þ 1=3� ð1� 0:2Þ ¼

0:4667, according to (3.1). Because the observed numberof correct responses (35) is smaller than the critical value

(37), we can conclude that the two brands of product are

the similar in sweetness. In other words, we can claimthat there is no detectable difference between the two

brands on sweetness at a significance level a ¼ 0:05 in

terms of pd0 ¼ 0:2.

3.2. Testing power and sample size

The power of the similarity testing is the probability

of making a conclusion of similarity when the true

proportion of ‘‘discriminator’’ is smaller than a specified

allowed or ignorable proportion, i.e., pd1 < pd0, in other

words, the corresponding true probability of correct

responses pc1 is smaller than pc0. The probability shouldbe as (3.2).

Power ¼ 1� b ¼Xc0x¼0

pxc1ð1� pc1Þn�x ð3:2Þ

where pc1 ¼ pd1 þ p0ð1� pd1Þ.Testing power depends on a, pd0, pd1, n and p0. For a

specified forced choice method, the larger the values of

a, pd0 and n, the larger the testing power. The larger thepd1, the smaller the testing power. If a, pd0, n and p0 arefixed, the maximum testing power is reached at pd1 ¼ 0

and the minimum testing power at pd1 ¼ pd0.For specified a, b, pd0 and pd1, the sample size, n, can

be calculated numerically from (3.1) and (3.2). Tables 3

and 4 give sample sizes needed to reach about 0.8 of

power in similarity testing using the 2-AFC, Duo-trio, 3-

AFC and Triangular methods.

Example 3.2. For the example in Example 3.1, now we

can estimate the testing power for an assumed true

proportion of ‘‘discriminator’’. If the assumed true

proportion of ‘‘discriminator’’ is pd1 ¼ 0:05, i.e., pc1 ¼pd1 þ p0ð1� pd1Þ ¼ 0:05þ 1

3� ð1� 0:05Þ ¼ 0:367, accord-

ing to (3.2), the testing power should be

Power ¼ 1� b ¼X37x¼0

0:367x � ð1� 0:367Þ100�x ¼ 0:57

However, if the assumed true proportion of ‘‘discrimi-

nator’’ is pd1 ¼ 0:1, i.e., pc1 ¼ pd1 þ p0ð1� pd1Þ ¼0:1þ 1

3� ð1� 0:1Þ ¼ 0:4, the testing power is only

Power ¼ 1� b ¼X37x¼0

0:4x � ð1� 0:4Þ100�x ¼ 0:31

In order to reach 0.8 of testing power using the 3-

AFC method for pd0 ¼ 0:2, pd1 ¼ 0:05, a ¼ 0:05, fromTable 4, we can find that the number of panelists should

be at least 160.

3.3. Comparison between one-sided interval hypotheses

testing with one-sided confidence interval for forced choice

methods

In sensory and consumer field, the confidence interval

method is often used to evaluate if the proportion of‘‘discriminator’’, pd, is less than the specified allowed

proportion, pd0, or equivalently, if the proportion of

correct responses, pc, is less than the allowed proportion,pc0 (ASTM, 1997; ASTM, 2001). The upper confidence

limit is usually calculated as p̂ þ z1�a

ffiffiffiffiffiffiffiffiffiffip̂ð1�p̂Þ

n

q, where

p̂ ¼ cn, and is compared with pc0.

Although the interval hypotheses testing is connected

with confidence interval, the two approaches are not

Table 3

Sample sizes needed to reach 0.8 of power is similarity testing using the 2-AFC and Duo-trio methods

pd1 a ¼ 0:05 a ¼ 0:10

pd0 pd0

0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5

0.00 646 160 74 43 25 469 112 57 32 19

0.05 2464 277 103 52 31 1820 207 72 40 24

0.10 624 160 69 36 459 113 57 31

0.15 2443 267 100 51 1769 206 73 38

0.20 602 151 65 438 115 48

0.25 2312 267 89 1740 187 66

0.30 551 141 410 102

0.35 2150 236 1585 175

0.40 511 380

0.45 1957 1435

Table 4

Sample sizes needed to reach 0.8 of power is similarity testing using the 3-AFC and Triangular methods

pd1 a ¼ 0:05 a ¼ 0:10

pd0 pd0

0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5

0.00 340 89 41 26 15 256 64 34 18 13

0.05 1355 160 61 31 19 990 122 43 23 14

0.10 357 89 44 22 262 73 30 17

0.15 1407 166 65 31 1039 122 41 24

0.20 367 93 42 269 67 32

0.25 1425 160 60 1058 112 45

0.30 352 89 266 66

0.35 1368 154 1027 118

0.40 333 250

0.45 1304 927

144 J. Bi / Food Quality and Preference 16 (2005) 139–149

exactly the same in decision rules and numerical results.

The decision rules for the two approaches are as (3.3)

and (3.4), respectively.

c < c0 ð3:3Þ

p̂ þ z1�a

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffip̂ð1� p̂Þ

n

r< pc0 ð3:4Þ

The decision rule in (3.3) is equal to p̂ < pc0 � k, i.e.,

p̂ þ k < pc0 ð3:5Þwhere k ¼ pc0 � c0

n .Comparing the two decision rules in (3.5) and (3.4), it

is noted that k in (3.5) is a constant and is obtained from

an exact binomial distribution, while z1�a

ffiffiffiffiffiffiffiffiffiffip̂ð1�p̂Þ

n

qin (3.4)

is a random variable and is estimated on the basis of anapproximate normal distribution.

4. Similarity testing using the A–Not A and the Same–

Different methods

In this section, we discuss similarity testing using the

monadic A–Not A and the Same–Different methods,

which are the discrimination methods with response

bias. The methods in this design involve comparison

between two independent proportions. As to the simi-

larity testing for dependent proportions in paired de-

signed A–Not A and the Same–Different methods, it is

not discussed in this article. For the techniques of sim-

ilarity assessment for dependent proportions in paired

design, see, e.g., Lu and Bean (1995), Nam (1997),

Tango (1998), Liu, Hsueh, Hsueh, and Chen (2002) andTang, Tang, and Chan (2003).

4.1. Dunnett and Gent’s chi-squared test

Dunnett and Gent (1977) suggested a chi-squared test

for similarity based on the data in a 2 · 2 table. Let pAand pN denote the probabilities of response ‘‘A’’ for

sample A and for sample Not A, respectively. The null

and alternative hypotheses are (4.1) and (4.2).

H0 : pA � pN ¼ D0 ð4:1ÞH1 : 06 pA � pN < D0 ð4:2Þwhere D0 is an allowable non-zero value defining

equivalence or similarity.

It is necessary to calculate the expected proportions

of response ‘‘A’’ for sample A and Not A assuming a

J. Bi / Food Quality and Preference 16 (2005) 139–149 145

non-zero value for the true difference of the proportions

pA � pN ¼ D0 under the null hypothesis. The expected

proportions are estimated from (4.3) and (4.4).

p̂A ¼ xþ y þ nND0

nA þ nNð4:3Þ

p̂N ¼ xþ y � nAD0

nA þ nNð4:4Þ

where x and y are observed numbers of response ‘‘A’’ for

sample A and Not A, respectively; nA and nN are sample

sizes for sample A and Not A. The expected number of

response ‘‘A’’ for sample A is then x0 ¼ nAp̂A.

Under the null hypothesis in (4.1), the test statistic is

(4.5).

X 2 ¼ ðx� x0Þ2 1

x0

þ 1

m� x0þ 1

nA � x0þ 1

nN � mþ x0

ð4:5Þ

where m ¼ xþ y. With continuity correction, (4.5) be-

comes (4.6).

X 2 ¼ ðjx� x0j � 0:5Þ2 1

x0

þ 1

m� x0þ 1

nA � x0

þ 1

nN � mþ x0

ð4:6Þ

The test statistic, X 2, follows the chi-square distri-

bution with one degree of freedom. Because it is as-

sumed that the proportion of response ‘‘A’’ for sample

A is not smaller than the proportion of response ‘‘A’’ for

sample Not A, this test is one-sided. The p-value shouldbe obtained dividing the tail area of the chi-square dis-

tribution by 2.

An alternative test statistic is (4.7), which follows

approximately the standard normal distribution under

the null hypothesis. We can reject the null hypothesis in

(4.1) and accept the alternative hypothesis in (4.2) at a asignificance level if the value of the statistic is smaller

than the a quantile of the standard normal distribution,i.e., Z < za. The p-value is the probability of Z < za.z0:05 ¼ �1:64 and z0:1 ¼ �1:28.

Z ¼ p̂A � p̂N � D0ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffibV ðp̂A � p̂NÞq ð4:7Þ

where bV ðp̂A � p̂NÞ is estimated variance of p̂A � p̂N un-

der the null hypothesis. With continuity correction, (4.7)becomes (4.8).

Z ¼ p̂A � p̂N � D0 þ n0ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffibV ðp̂A � p̂NÞq ð4:8Þ

where n0 ¼ ð1=nA þ 1=nNÞ=2.There are different methods for estimation of the

variance. One method is to use expected proportion, p̂A

and p̂N in (4.3) and (4.4), rather than the observed

proportions, p̂A and p̂N for estimation of the variance

(see, e.g., Roadary, Com-Nougue, & Tournade, 1989).

The estimated variance using the expected proportions is

(4.9).

bV ðp̂A � p̂NÞ ¼p̂Að1� p̂AÞ

nAþ p̂Nð1� p̂NÞ

nNð4:9Þ

It should be noted that if the difference of the two

estimated proportions is larger than the allowable dif-

ference, i.e., p̂A � p̂N > D0, the similarity testing should

stop, because the similarity cannot be concluded at any

a meaningful significance level, a, in the situation.

It can be shown algebraically that in the difference

testing using the A–Not A method, the chi-square test

with one degree of freedom is exactly equivalent to a Ztest for comparison of two independent proportions

based on a normal approximation, provided that a

pooled proportion is used in estimate of a common

variance in null hypothesis (Snedecor & Cochran, 1989).

However, it is noted that these two tests are no longer

exactly equivalent in the similarity testing because of a

different variance estimator in a Z test statistic.

Example 4.1. In order to make sure if a product (sample

Not A) with substituted ingredients has the similar

sensory characteristic with the current product (sampleA), a similarity testing for two products was conducted

using a monadic A–Not A method. 200 panelists re-

ceived A sample and 200 received Not A sample, i.e.,

nA ¼ nN ¼ 200. The specified allowable limit defining

similarity is selected as 0.1. It means that we regard the

two products as similarity if the difference of the pro-

portions of response ‘‘A’’ for sample A and sample Not

A is not larger than 0.1.The observed numbers of response ‘‘A’’ for sample A

and sample Not A are x ¼ 45 and y ¼ 39, respectively.

Hence m ¼ 45þ 39 ¼ 84. According to (4.3), the ex-

pected proportion of response ‘‘A’’ for sample A is

p̂A ¼ xþyþnND0

nAþnN¼ 45þ39þ200�0:1

200þ200 ¼ 0:26. Hence the expected

number x0 ¼ 200� 0:26 ¼ 52. The value of the test sta-

tistic in (4.5) is then

X 2 ¼ ð45� 52Þ2 1

52

þ 1

84� 52þ 1

200� 52

þ 1

200� 84þ 52

¼ 3:096

The value of 3.096 is the 0.922 quantile of the chi-square

distribution with one degree of freedom. The tail area is0.078. The p-value of one-sided test is then 0.078/

2¼ 0.039. We can conclude at a 0.05 significance level

that the two products are similar in terms of 0.1 allow-

able limit defining equivalence.

If the statistic of normal approximation in (4.7) is

used, firstly we calculate the expected proportions under

the null hypothesis.

146 J. Bi / Food Quality and Preference 16 (2005) 139–149

p̂A ¼ xþ y þ nND0

nA þ nN¼ 0:26

p̂N ¼ xþ y � nND0

nA þ nN¼ 45þ 39� 200� 0:1

200þ 200¼ 0:26

According to (4.9), the estimated variance of p̂A � p̂Nunder the null hypothesis is then

bV ðp̂A � p̂NÞ ¼0:26� ð1� 0:26Þ

200þ 0:16� ð1� 0:16Þ

200

¼ 0:00163

The value of the test statistic (4.7) is

Z ¼ 45=200� 39=200� 0:1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi0:00163

p ¼ �1:73

with the associated p-value¼ 0.042. It shows good

agreement between the chi-square approach and thenormal approximate approach.

4.2. Testing power and sample size

The power for similarity testing using the A–Not A

and the Same–Different methods is the probability of

concluding similarity when the true difference ðD1Þ of theproportions of response ‘‘A’’ for sample A and Not A is

smaller than a specified similarity limit ðD0Þ under thealternative hypothesis. It is

Power ¼ 1� b ¼ Pp̂A � p̂N � D0ffiffiffiffiffi

V0p

< zajH1

!ð4:10Þ

where V0 denotes the variance of p̂A � p̂N under the nullhypothesis. Eq. (4.10) is equivalent to (4.11).

1� b ¼ Pp̂A � p̂N � D1ffiffiffiffiffi

V1p

<

zaffiffiffiffiffiV0

pþ ðD0 � D1Þffiffiffiffiffi

V1p

����H1

!ð4:11Þ

where V1 denotes the variance of p̂A � p̂N under the

alternative hypothesis. Because p̂A�p̂N�D1ffiffiffiffiV1

p is an approxi-

mate standard normal statistic under the alternativehypothesis, the testing power can be calculated from

(4.12).

Power ¼ 1� b ¼ P Z�

<za

ffiffiffiffiffiV0

pþ ðD0 � D1Þffiffiffiffiffi

V1p

�ð4:12Þ

The variances of p̂A � p̂N under the null and the alter-

native hypotheses are (4.13) and (4.14), respectively.

V0 ¼pNð1� pNÞ

nNþ ðpN þ D0Þð1� pN � D0Þ

nAð4:13Þ

V1 ¼pNð1� pNÞ

nNþ ðpN þ D1Þð1� pN � D1Þ

nAð4:14Þ

From (4.12), we can see that in order to calculate a

testing power, the values of the six characteristics: a, D0,

D1, pN, nA and nN, should be given or assumed. The

larger the values of a, D0, nA and nN are, the larger

the testing power is. On the other hand, the smaller the

values of D1, and pN are, the larger the testing power is.A small pN means that panelists have small probability

of response ‘‘A’’ for Not sample. A small D1 value

means that the difference between the two true proba-

bilities pA and pN in an alternative hypothesis is small.

Testing power is a complement of Type II error b. TypeII error b is a probability of failure to reject the null

hypothesis of inequivalence when the two true proba-

bilities pN and pA in fact are similar.From (4.12), a sample size formula can be derived

from (4.15).

nN ¼z1�b

ffiffiffiffiffiV 01

pþ z1�a

ffiffiffiffiffiV 00

pD0 � D1

" #2ð4:15Þ

where

V 00 ¼ pNð1� pNÞ þ ðpN þ D0Þð1� pN � D0Þ=h

V 01 ¼ pNð1� pNÞ þ ðpN þ D1Þð1� pN � D1Þ=h

h ¼ nA=nN

The ratio of sample sizes of sample A and sample Not

A, i.e., h, should be predetermined. The same samplesize for sample A and sample Not A, i.e., h ¼ 1, is often

adopted.

Example 4.2. In Example 4.1, nA ¼ nN ¼ 200. If

a ¼ 0:1;D0 ¼ 0:2 are selected, pN ¼ 0:2 and D1 ¼ 0:1 areassumed, the testing power can be calculated. The

variances of p̂A � p̂N under the null and the alternative

hypotheses are

V0 ¼0:2� ð1� 0:2Þ

200þ ð0:2þ 0:2Þð1� 0:2� 0:2Þ

200

¼ 0:002

V1 ¼0:2� ð1� 0:2Þ

200þ ð0:2þ 0:1Þð1� 0:2� 0:1Þ

200

¼ 0:00185

According to (4.12), the power should be

1� b ¼ P Z

<

�1:28�ffiffiffiffiffiffiffiffiffiffiffi0:002

pþ ð0:2� 0:1Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

0:00185p

!¼ P ðZ < 0:992Þ ¼ 0:84

For the same situation: a ¼ 0:1, D0 ¼ 0:2, pN ¼ 0:2and D1 ¼ 0:1, the sample size needed to reach 0.8 ofpower can be calculated from (4.15). Because

V 00 ¼ 0:2� ð1� 0:2Þ þ ð0:2þ 0:2Þð1� 0:2� 0:2Þ=1¼ 0:4

V 01 ¼ 0:2� ð1� 0:2Þ þ ð0:2þ 0:1Þð1� 0:2� 0:1Þ=1¼ 0:37

J. Bi / Food Quality and Preference 16 (2005) 139–149 147

the sample size for sample A and Not A should be

nA ¼ nN ¼ 0:84�ffiffiffiffiffiffiffiffiffi0:37

pþ 1:28�

ffiffiffiffiffiffiffi0:4

p

0:2� 0:1

" #2¼ 175

5. Similarity testing for preference proportions

In order to do the tests, we first estimate the pro-

portions of preferences, p̂a and p̂b as well as the

covariance of p̂a and p̂b, where p̂a and p̂b are the esti-

mates of proportions preferring product A and B,p̂a þ p̂b 6 1. There are different models for estimating

preference proportions and their variances and covari-

ance, see, e.g., Ferris (1958), Bliss (1960), Gridgeman

(1960), Horsnell (1969, 1977), Wierenga (1974), Hutch-

inson (1979) and Ennis & Bi (1999). A detailed discus-

sion of the estimation problem for preference

proportions is beyond the scope of this article. One

point that should be noted is that estimating consumerpreference proportion in a preference test is quite dif-

ferent from, and more complicated than, estimating the

proportion of ‘‘discriminators’’ in a discrimination test,

because two independent parameters rather than one

independent parameter are involved in preference test. It

seems that there is not a simple non-replicated proce-

dure for estimating preference proportions. A famous

procedure is the Ferris’s k-visit method using maximumlikelihood estimate (Ferris, 1958). In this method, ‘‘no

preference’’ option is allowed. For the 2-visit method,

each of consumer panelists is either visited twice or

asked to judge twice the same pair of products A and B.

The total N panelists can then be classified into nine

different categories according to their responses in the

two judgements. Based on the data, the preference

proportions and their covariance matrix can be esti-mated.

5.1. Schuirmann’s two one-sided test

The objective is to test if the difference of the pro-

portions preferring product A and B, jpa � pbj, is smallerthan a specified allowed value, D0. This test involves two

sets of one-sided hypotheses:

H01 : pa � pb 6 � D0 versus

H11 : pa � pb > �D0

ð5:1Þ

and

H02 : pa � pb P D0 versus

H12 : pa � pb < D0

ð5:2Þ

The first set of hypotheses in (5.1) is to test for non-

inferiority of product A to product B. The second set of

hypotheses in (5.2) is to test for non-superiority of

product A to product B. We can declare the two prod-

ucts are similar in preference if and only if both H01 and

H02 are rejected at a significance level a.The test statistics are (5.3) and (5.4), which follow

approximately a standard normal distribution.

Z1 ¼ðp̂a � p̂bÞ þ D0

r̂ð5:3Þ

Z2 ¼ðp̂a � p̂bÞ � D0

r̂ð5:4Þ

where r̂ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiV ðp̂aÞ þ V ðp̂bÞ � 2Covðp̂a; p̂bÞ

p. We can

conclude that the two proportions of preferences for

products A and B in a specified consumer population

are equivalent if Z1 > z1�a and Z2 < za.

Example 5.1.We use the results in Ferris (1958). In this

example, ‘‘no preference’’ option is allowed in the pre-

ference testing. The estimated values are p̂a ¼ 0:4968,p̂b ¼ 0:3702, V ðp̂aÞ ¼ 0:000296, V ðp̂aÞ ¼ 0:000277 and

Covðp̂a; p̂bÞ ¼ �0:000198. If D0 ¼ 0:2; a ¼ 0:05 are spec-ified, the values of the test statistics in (5.3) and (5.4) can

be obtained.

r̂ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi0:000296þ 0:000277þ 2� 0:000198

p¼ 0:0311

Z1 ¼ð0:4968� 0:3702Þ þ 0:2

0:0311¼ 10:5

and

Z2 ¼ð0:4968� 0:3702Þ � 0:2

0:0311¼ �2:36

Because Z1 > z0:95 ¼ 1:64, and Z2 < z0:05 ¼ �1:64, hencewe can conclude that the two products are similar in

preference.

If we specified D0 ¼ 0:1 rather then D0 ¼ 0:2, then

Z1 ¼ð0:4968� 0:3702Þ þ 0:1

0:0311¼ 7:29 and

Z2 ¼ð0:4968� 0:3702Þ � 0:1

0:0311¼ �0:86

Because Z1 > z0:95 ¼ 1:64, but Z2 > z0:05 ¼ �1:64, hencewe cannot conclude that the two products are similar in

preference. We can, however, claim that product A is

non-inferior to product B at a 0.05 significance level.

Further information is needed to verify product A issimilar or superior to product B.

5.2. Testing power and sample size

Let p̂ ¼ p̂a � p̂b. In the interval hypotheses testing for

preference proportions based on Eqs. (5.3) and (5.4), the

null hypothesis of difference will be rejected and the

alternative hypothesis of similarity will be accepted at

the a level of significance if p̂þD0

r̂ > z1�a andp̂�D0

r̂ < za. Inother words, the rejection region is

148 J. Bi / Food Quality and Preference 16 (2005) 139–149

�D0 þ z1�ar̂ < p̂ < D0 þ zar̂ ð5:5Þ

The testing power is the probability of correctly con-

cluding similarity when p̂ falls into the rejection region.

If the true difference of the two preference proportions is

p0, the testing power is then

Power ¼ 1� b ¼ Prf�D0 þ z1�ar̂ � p0 < p̂ � p0

< D0 þ zar̂ � p0g; i:e:;

Power ¼ 1� b ¼ Pr�D0 þ z1�ar̂ � p0

(<

p̂ � p0r̂

<D0 þ zar̂ � p0

)ð5:6Þ

Because p̂�p0r̂ follows asymptotically the standard normal

distribution, hence testing power is as (5.7).

Power ¼ 1� b ¼ UðaÞ � UðbÞ ð5:7Þ

where a ¼ D0þzar̂�p0r̂ ; b ¼ �D0þz1�ar̂�p0

r̂ and b is the proba-

bility of failing to reject the null hypothesis of difference

when it is false.

Note that the magnitude of r̂ depends on the methods

to estimate the preference proportions. In each of themethods, r̂ contains a component of sample size N .Hence r̂ ¼ r̂0ffiffiffi

Np , where r̂0 is a component of r̂ indepen-

dent of sample size.

The testing power for non-inferiority should be in

(5.8) and for non-superiority in (5.9).

Power ¼ 1� b ¼ 1� UðbÞ ð5:8Þ

Power ¼ 1� b ¼ UðaÞ ð5:9Þ

For specified D0, a, b and assumed p0, and r̂0, we can

estimate the effective sample size, N , from (5.10), which

is derived from (5.6).

N r̂20ðz1�b þ z1�aÞ2

ðD0 � p0Þ2ð5:10Þ

Example 5.2. For D0 ¼ 0:1; a ¼ 0:05; r̂0 ¼ 0:2;N ¼ 100,

if the true difference of preference proportions is as-

sumed 0.05, according to (5.7), the power of the simi-

larity testing for preference can be calculated as follows:

a ¼ 0:1� 1:64� 0:2=ffiffiffiffiffiffiffiffi100

p� 0:05

0:2=ffiffiffiffiffiffiffiffi100

p ¼ 0:855

b ¼ �0:1þ 1:64� 0:2=ffiffiffiffiffiffiffiffi100

p� 0:05

0:2=ffiffiffiffiffiffiffiffi100

p ¼ �5:855

Hence Power ¼ 1� b ¼ Uð0:855Þ � Uð�5:855Þ ¼ 0:80.On the other hand, for D0 ¼ 0:1; a ¼ 0:05; r̂0 ¼ 0:2,

we want to know the sample size, i.e., the number of

panelists, needed to reach 0.8 of power for a preference

testing. From (5.10), we can estimate

N 0:22 � ð0:842þ 1:645Þ2

ð0:1� 0:05Þ2 99

The value of r̂0 should be obtained from some prior

information or from a small pilot experiment.

6. Conclusions

Similarity testing is an important methodology of

sensory and consumer research. The ‘‘power approach’’,

which is prevalent in the sensory and consumer field, isinadequate because it contorts the logic of hypothesis

testing. One weakness of this approach is that for a large

sample size and a small measurement error, it is unlikely

to draw a conclusion of similarity even for a slight dif-

ference but effective equivalence. As an alternative, the

interval hypotheses testing, which is originally devel-

oped for assessment of bioequivalence in clinical and

pharmaceutical fields, is a valid approach for similaritytesting. Based on this approach, some procedures for

similarity testing are discussed and proposed for differ-

ent situations in sensory and consumer research. The

procedures include similarity testing for hedonic or

intensity rating means, similarity testing using conven-

tional discrimination methods and similarity testing for

preference proportions. Tables of maximum number of

correct responses for the similarity tests using the 2-AFC, Duo-trio, 3-AFC and Triangular methods are

provided in the paper. Tables of sample sizes needed to

reach 0.8 of power in the similarity tests using the forced

choice methods are also provided.

Acknowledgements

The author would like to thank the Editor and two

anonymous referees for their constructive comments on

the earlier version of the article.

References

Altman, D. G., & Bland, J. M. (1995). Absence of evidence is not

evidence of absence. British Medical Journal, 311, 485.

Anderson, S., & Hauck, W. W. (1983). A new procedure for testing

equivalence in equivalence in comparative bioavailability and other

clinic trials. Communications in Statistics, A.12, 2663–2692.

ASTM (1997). Standard test method for sensory analysis––Triangle

test. ASTM E1885-97.

ASTM (2001). Standard test method for directional difference test.

ASTM E2164-01.

Barker, L., Rolka, H., Rolka, D., & Brown, C. (2001). Equivalence

testing for binomial random variables: Which test to use? The

American Statistician, 55(4), 279–287.

Berger, R. L., & Hsu, J. C. (1996). Bioequivalence trials, intersection–

union tests and equivalence confidence sets (with Discussion).

Statistical Science, 11, 283–319.

Blackwelder, W. C. (1982). ‘‘Proving the null hypothesis’’ in clinical

trials. Controlled Clinical Trials, 3, 345–353.

J. Bi / Food Quality and Preference 16 (2005) 139–149 149

Bliss, C. I. (1960). Some statistical aspects of preference and related

tests. Applied Statistics, 9, 8–19.

Carr, B. T. (1995). Confidence intervals in the analysis of sensory

discrimination tests––The integration of similarity and difference

testing. In Proceedings of 4th AgoStat Dijon, 7–8 December 1995,

pp. 23–31.

Chow, S. C., & Lui, J. P. (1992). Design and analysis of bioavailability

and bioequivalence studies. New York: Marcel Dekker.

Dunnett, C. W., & Gent, M. (1977). Significance testing to establish

equivalence between treatments, with special reference to data in

the form of 2· 2 tables. Biometrics, 33, 593–602.

Ennis, D. M., & Bi, J. (1999). The Dirichlet-multinomial model:

Accounting for intertrial variation in replicated ratings. Journal of

Sensory Studies, 14, 321–345.

FDA (1992). Bioavailability and Bioequivalence Requirements. In US

code of federal regulations (Vol. 21), Washington, DC: US

Government Printing Office.

Ferris, G. E. (1958). The k-visit method of consumer testing.

Biometrics, 14, 39–49.

Gacula, M. C., Jr. (1991). Claim substantiation for sensory equivalence

and superiority. In H. T. Lawless & B. P. Klein (Eds.), Sensory

science theory and applications in foods. New York: Marcel Dekker.

Gridgeman, N. T. (1960). Statistics and taste testing. Applied Statistics,

9, 103–112.

Hauck, W. W., & Anderson, S. (1984). A new statistical procedure

for testing equivalence in two-group comparative bioavailability

trials. Journal of Pharmacokinetics and Biopharmaceutics, 12, 72–

78.

Hauschke, D., Steinijans, V. W., & Diletti, E. (1990). A distribution-

free procedure for the statistical analyses of bioequivalence studies.

International Journal of Clinical Pharmacology and Therapeutics

and Toxicology, 28, 72–78.

Horsnell, G. (1969). A theory of consumer behavior derived from

repeat paired preference testing (with Discussion). Journal of the

Royal Statistical Society Series A, 132, 164–192.

Horsnell, G. (1977). Paired comparison product testing when individ-

ual preferences are stochastic: An alternative model. Applied

Statistics, 26, 162–172.

Hsu, J. C., Hwang, J. T. G., Lui, H.-K., & Ruberg, S. J. (1994).

Confidence intervals associated with tests for bioequivalence.

Biometrika, 81, 103–114.

Hutchinson, T. P. (1979). A comment on replicated paired compar-

isons. Applied Statistics, 28, 163–169.

Liu, J.-P., Hsueh, H.-M., Hsueh, E., & Chen, J. J. (2002). Tests for

equivalence or non-inferiority for paired binary data. Statistics in

Medicine, 21, 231–245.

Lu, Y., & Bean, J. A. (1995). On the sample size for one-sided

equivalence of sensitivities based upon McNemar’s test. Statistics

in Medicine, 14, 1831–1839.

MacRae, A. W. (1995). Confidence intervals for the triangle test can

give reassurance that products are similar. Food Quality and

Preference, 6, 61–67.

Mandallaz, D., & Mau, J. (1981). Comparison of different methods for

decision-making in bioequivalence assessment. Biometrics, 37, 213–

222.

Meilgaard, M., Civille, G. V., & Carr, B. T. (1991). Sensory evaluation

techniques (2nd ed.). Boca Raton: CRC Press.

Metzler, C. M., & Haung, D. C. (1983). Statistical methods for

bioavailability and bioequivalence. Clinical Research and Practices

& Drug Regulatory Affairs, 1, 109–132.

Nam, J.-M. (1997). Establishing equivalence of two treatments and

sample size requirements in matched–paired design. Biometrics, 53,

1422–1430.

O’Quigley, J., & Baudoin, C. (1988). General approaches to the

problem of bioequivalence. The Statistician, 37, 51–58.

Patel, H. I., & Gupta, G. D. (1984). A problem of equivalence in

clinical trials. Biometrical Journal, 26, 471–474.

Rashid, M. M. (2003). Rank-based tests for non-inferiority and

equivalence hypotheses in multi-centre clinical trials using mixed

models. Statistics in Medicine, 22, 291–311.

Roadary, C., Com-Nougue, C., & Tournade, M.-F. (1989). How to

establish equivalence between treatments: a one sided clinical trial

in paediatric oncology. Statistics in Medicine, 8, 593–598.

Rocke, D. M. (1984). On testing for bioequivalence. Biometrics, 40,

225–230.

Rodda, B. E., & Davis, R. L. (1980). Determining the probability of an

important difference in bioavailability. Clinical Pharmacology &

Therapeutics, 28, 247–252.

Schuirmann, D. J. (1981). On hypothesis testing to determine if the

mean of a normal distribution is contained in a known interval.

Biometrics, 37, 617.

Schuirmann, D. J. (1987). A comparison of the two one-sided tests

procedure and the power approach for assessing the equivalent of

average bioavailability. Journal of Pharmacokinetic and Biophar-

maceutics, 15, 657–680.

Selwyn, M. R., Dempster, A. P., & Hall, N. R. (1981). A Bayesian

approach to bioequivalence for the 2· 2 changeover design.

Biometrics, 37, 11–21.

Senn, S. J. (1997). Statistical issues in drug development. Chichester:

Wiley.

Snedecor, G. W., & Cochran, G. C. (1989). Statistical methods (8th

ed.). Ames: Iowa State University Press.

Steinijans, V. W., & Diletti, E. (1983). Statistical analysis of bioavail-

ability studies: Parametric and nonparametric confidence intervals.

European Journal of Clinical Pharmacology, 24, 127–136.

Tang, N.-S., Tang, M.-L., & Chan, I. S. F. (2003). On tests of

equivalence via non-unity relative risk for match–pair design.

Statistics in Medicine, 22, 1217–1233.

Tango, T. (1998). Equivalence test and confidence interval for the

difference in proportions for the paired-sample design. Statistics in

Medicine, 17, 891–908.

Westlake, W. J. (1972). Use of confidence intervals in analysis of

comparative bioavailability trials. Journal of Pharmaceutical Sci-

ence, 61, 1340–1341.

Westlake, W. J. (1976). Symmetrical confidence intervals for bioequi-

valence trials. Biometrics, 32, 741–744.

Westlake, W. J. (1979). Statistical aspects of comparative bioavail-

ability trials. Biometrics, 35, 273–280.

Wierenga, B. (1974). Paired comparison product testing when

individual preference are stochastic. Applied Statistics, 23, 384–396.