Similarity testing in sensory and consumer research
-
Upload
independent -
Category
Documents
-
view
1 -
download
0
Transcript of Similarity testing in sensory and consumer research
Food Quality and Preference 16 (2005) 139–149
www.elsevier.com/locate/foodqual
Similarity testing in sensory and consumer research
Jian Bi
Sensometrics Research and Service, 9212 Groomfield Road, Richmond, VA 23236, USA
Available online 5 May 2004
Abstract
Similarity testing is an important methodology in sensory and consumer research. It is of practical and theoretical significance. In
many comparative experiments in sensory and consumer field, the objective is not to demonstrate difference but to demonstrate
equivalence or similarity between treatments. It is widely acknowledged that the conventional hypothesis test with the null
hypothesis of ‘‘no difference’’ between two treatments is appropriate for difference testing but inappropriate for similarity testing.
Schuirmann [Biometrics 37 (1981) 617] introduced the use of interval hypotheses testing in the context of bioequivalence. For the
interval hypotheses testing, the null hypothesis is that the difference between two treatments is larger than a specified non-zero value
defining equivalence or similarity and the alternative hypothesis is that the difference is smaller than the specified value. If the null
hypothesis is rejected, equivalence or similarity is then concluded. Based on this concept and approach, the present paper discusses
some statistical procedures of similarity testing used for different situations in sensory and consumer research. The procedures
include similarity testing for hedonic or intensity rating means, similarity testing using conventional discrimination methods (forced
choice methods and the methods with response bias) and similarity testing for preference proportions.
� 2004 Elsevier Ltd. All rights reserved.
Keywords: Similarity testing; Interval hypotheses testing; Anderson and Hauck’s non-central t test; Dunnett and Gent’s chi-squared test;
Schuirmann’s two one-sided tests
1. Introduction
1.1. Similarity evaluation
There are many situations in sensory and consumer
research where the objective is not to demonstrate dif-
ference but to demonstrate similarity between treat-
ments. For example, a manufacturer may replace a
chemical with another substance hoping that the fin-
ished product will maintain the same perceived intensityof certain sensory characteristics.
Proof of exact equality is theoretically impossible.
Similarity testing merely demonstrates statistically that
the difference between two products being compared is
smaller than the allowed difference in intensity or pref-
erence. Similarity evaluation is of practical and theo-
retical importance. It has wide applications not only in
sensory and consumer field (Gacula, 1991; Meilgaard,Civille, & Carr, 1991) but also in many other fields, e.g.,
clinical and pharmaceutical fields (Metzler & Haung,
1983).
E-mail address: [email protected] (J. Bi).
0950-3293/$ - see front matter � 2004 Elsevier Ltd. All rights reserved.
doi:10.1016/j.foodqual.2004.03.003
It is widely acknowledged that the conventional
hypothesis testing used for difference testing is inap-propriate in the context of similarity (see, e.g., Anderson
& Hauck, 1983; Berger & Hsu, 1996; Blackwelder, 1982;
Dunnett & Gent, 1977; Schuirmann, 1987; Westlake,
1972, 1979). The basic difficulty is that the null
hypothesis of no difference can never be proved or
established but be possibly disproved according to the
logic of statistical hypothesis testing (see, e.g., Blackw-
elder, 1982). Absence of evidence of difference is not anevidence of similarity (Altman & Bland, 1995). Lack of
significance may merely be the result of inadequate
sample size, while a trivial difference may be statistically
significant with large sample size. In order to conduct a
similarity testing, some new ways of thinking about
statistical hypothesis testing and some new statistical
models are needed.
The problem of similarity evaluation has been dis-cussed extensively since it was introduced to applied
statistics by Westlake (1972). Considerable efforts have
been made in developing the methodology of similarity
evaluation in many fields, especially in clinical and
pharmaceutical fields. Many methods have been devel-
oped and proposed (see, e.g., Chow & Lui, 1992;
140 J. Bi / Food Quality and Preference 16 (2005) 139–149
Metzler & Haung, 1983, for an overview). The methods
include the confidence interval approach (see, e.g., Carr,
1995; MacRae, 1995; Westlake, 1972, 1976), the interval
hypotheses testing (see, e.g., Anderson & Hauck, 1983;Patel & Gupta, 1984; Rocke, 1984; Schuirmann, 1981,
1987), the Bayesian approach (see, e.g., Mandallaz &
Mau, 1981; Rodda & Davis, 1980; Selwyn, Dempster, &
Hall, 1981) and non-parametric methods (see, e.g.,
Hauschke, Steinijans, & Diletti, 1990; Rashid, 2003;
Steinijans & Diletti, 1983). For the connection between
similarity tests and confidence intervals, see, e.g.,
O’Quigley and Baudoin (1988), Hsu, Hwang, Lui, andRuberg (1994) and Berger and Hsu (1996).
This article concentrates on interval hypotheses test-
ing for assessing similarity. The objective of the article is
to discuss similarity testing with four specific application
areas in sensory and consumer research: hedonic or
intensity ratings, forced choice methods, A–Not A/
Same–Different methods and consumer preference pro-
portions. The article is motivated by the fact that someinappropriate procedures, e.g., so-called ‘‘power ap-
proach’’ that has been abandoned in some fields (Senn,
1997, p. 319–320), are still used widely in sensory and
consumer field, while some valid concepts and proce-
dures, e.g., the interval hypotheses testing, which has
been extensively discussed and well developed in clinical
and pharmaceutical fields, are not well-known in sen-
sory and consumer field.
1.2. ‘‘Power approach’’
In sensory and consumer field, the widely used method
for similarity testing is so-called ‘‘power approach’’ (the‘‘power’’ here refers to difference testing power rather
than similarity testing power). Using the power ap-
proach, a small Type II error ðbÞ, i.e., a large power
ð1� bÞ value is selected for a specified allowed difference,D0. A sample size is then determined to ensure the large
power to detect the difference. If the null hypothesis of no
difference is not rejected, similarity is then concluded.
This approach is based on the logic that if a difference islarger than a specified allowed difference, the difference
should likely be detected and the null hypothesis of no
difference should likely be rejected. On the other hand, if
a difference is smaller than a specified allowed difference,
the null hypothesis should likely not be rejected.
At one time, the ‘‘power approach’’ was a standard
method in bioequivalence testing. However, due to its
unsuitability, the approach was finally abandoned inevaluation requirements of the US Food and Drug
Administration (FDA, 1992, Chap. 320). Some authors,
e.g., Schuirmann (1987) have shown in a detailed
examination that the power approach is quite inade-
quate for similarity testing. One reason is that this ap-
proach contorts the logic of hypothesis testing.
According to the logic, we cannot prove and accept the
null hypothesis in any situation. Furthermore, one
weakness in this method is that for a large sample size
and a small measurement error, it is unlikely to draw a
conclusion of similarity even for a slight difference buteffective equivalence. For example, in a similarity test
using the triangle method, if a ¼ 0:1, b ¼ 0:05 are se-
lected for a specified proportion of discriminators,
pd ¼ 0:3 (i.e., the proportion of correct responses, pc ¼0.533), the required sample size is 54 (see, ASTM E1885-
97). Assuming that the true proportion of discriminators
is pd ¼ 0:1 (i.e., pc ¼ 0:4), a reasonable similarity test
should confirm the similarity with high probability.However, a simulation experiment, using 5000 sets of
binomial data with sample size n ¼ 54 and parameter
p ¼ 0:4, shows that this similarity testing, has only 0.49of chance to confirm the similarity. If the sample size is
540 instated of 54, using the same testing procedure, the
chance to get a conclusion of similarity is only 0.02. The
simulation results show that the more the sample size,
the smaller the probability to get the conclusion ofsimilarity regardless of a small true difference. Hence the
power approach for similarity testing is problematic.
1.3. Interval hypotheses testing and confidence interval
It is consensus in statistical literature that the interval
hypotheses testing is a suitable approach for similaritytesting (see, e.g., Anderson & Hauck, 1983; Barker,
Rolka, Rolka, & Brown, 2001; Chow & Lui, 1992;
Schuirmann, 1981, 1984; Rocke, 1984; Schuirmann,
1987). In this approach the null hypothesis is that the
difference of two treatments is equal to or larger than a
specified allowed difference and the alternative hypoth-
esis is that the difference is smaller than the specified
value, i.e.,
H0 : jl1 � l2jP D0
H1 : �D0 < l1 � l2 < D0
ð1:1Þ
If the null hypothesis is rejected, the alternative
hypothesis (i.e., similarity between the two products forcomparison) can be concluded. In the interval hypoth-
eses testing, Type I error, a, is the probability of con-
cluding similarity when the two treatments are in fact
different. a ¼ 0:05 or 0.1 is usually selected. Type II er-
ror, b, is the probability of failing to conclude similarity
when the two treatments are similar. The power of a
similarity testing, 1� b, is then the probability of cor-
rectly rejecting the null hypothesis of difference andaccepting the alternative hypothesis of similarity when
the two treatments are similar. b ¼ 0:1 or 0.2, i.e.,
1� b ¼ 0:9 or 0.8 is usually selected.
The confidence interval approach can also be used for
similarity evaluation. If the confidence interval of
l1 � l2 is within ð�D0;D0Þ, similarity can be concluded.
However, for the size-a similarity test, the conventional
J. Bi / Food Quality and Preference 16 (2005) 139–149 141
confidence interval is with a 100ð1� 2aÞ% rather than a
100ð1� aÞ% confidence level. Some authors, e.g., Berger
and Hsu (1996) pointed out that the misconception that
size-a similarity tests generally correspond to 100ð1�2aÞ% confidence interval confidence interval leads to
incorrect statistical practices and concluded that the
usage of 100ð1� 2aÞ% confidence interval to evaluate
similarity should be abandoned. Techniques for con-
structing 100ð1� aÞ% confidence intervals that corre-
spond to size-a. similarity tests are developed by Hsu
et al. (1994) and Berger and Hsu (1996).
2. Similarity testing for hedonic or intensity rating means
2.1. Anderson and Hauck’s non-central t test
It is widely accepted in sensory and consumer field
that the hedonic or intensity ratings data using the 9-
point scale can be approximately regarded as continuous
data. Similarity testing for two hedonic or intensityrating means is often needed. The null hypothesis and
alternative hypothesis for the test are as in (1.1).
A procedure that we will discuss in this section is that
proposed by Anderson and Hauck (1983) and Hauck
and Anderson (1984). This procedure can be used to
evaluate the null hypothesis in (1.1) directly. In other
words, we would reject the null hypothesis of difference
for two hedonic or intensity rating means in favor ofsimilarity for a small p-value.
For simplicity, it is assumed here that a completely
randomized design with n subjects per group is used.
Extensions to a crossover design and other designs are
straightforward. The test statistic is TAH in (2.1), which
follows a non-central t distribution with non-centrality
parameter as in (2.2) and degrees of freedom m under thenull hypothesis.
TAH ¼ X 1 � X 2
sffiffiffiffiffiffiffiffi2=n
p ð2:1Þ
d ¼ l1 � l2
rffiffiffiffiffiffiffiffi2=n
p ð2:2Þ
The non-centrality parameter d can be estimated by
(2.3).
d̂ ¼ D0
sffiffiffiffiffiffiffiffi2=n
p ð2:3Þ
where D0 is the specified maximum allowable difference
between two true rating means for similarity; s2 is theerror deviation calculated from the appropriate analysis
of variance with degrees of freedom m. For a completelyrandom design, s2 is the common variance of the two
rating samples X1 and X2, s2 ¼s21þs2
2
2with m ¼ 2ðn� 1Þ
degrees of freedom, s21 and s22 are variances of X1 and X2.
For other design, e.g., for a crossover design, m ¼ n� 2.
The p-value is calculated from (2.4).
p ¼ FmðjtAHj � d̂Þ � Fmð�jtAHj � d̂Þ ð2:4Þwhere Fmð Þ denotes the distribution function of the
central t distribution with degrees of freedom m and tAHdenotes the observed value of test statistic TAH. The nullhypothesis would be rejected in favor of similarity if the
observed p-value is less than the significance level a.
Example 2.1. In order to determine if consumers in twocities (A and B) have the similar overall likings for a
product, 100 panelists were selected in each of the two
cities and 9-point liking scale is used with 9¼ ‘‘Like
extremely’’ and 1¼ ‘‘Dislike extremely’’. The similarity
limit D0 ¼ 0:5 and significance level a ¼ 0:1 were se-
lected. The observed overall liking means and their
variances for the two cities are XA ¼ 7:1, s2A ¼ 2:0;X B ¼ 6:9, s2B ¼ 2:2. Hence the estimate of the commonvariance of the two cities is s2 ¼ 2:0þ2:2
2¼ 2:1, i.e.,
s ¼ 1:45.
The observed value of the test statistic is
TAH ¼ 7:1� 6:9
1:45�ffiffiffiffiffiffiffiffiffiffiffiffiffi2=100
p ¼ 0:976
The estimated non-centrality parameter is
d̂ ¼ 0:5
1:45�ffiffiffiffiffiffiffiffiffiffiffiffiffi2=100
p ¼ 2:44
The calculated p-value is then
p ¼ Fmð0:976� 2:44Þ � Fmð�0:976� 2:44Þ¼ Fmð�1:464Þ � Fmð�3:416Þ ¼ 0:072
where Fmð�1:464Þ ¼ 0:0724 is the probability of the
central t distribution with m ¼ 2� ð100� 1Þ ¼ 198 de-
grees of freedom from �1 to )1.464 and Fmð�3:416Þ ¼Fmð�3:416Þ ¼ 0:0004 is the probability of the central tdistribution with 198 degrees of freedom from �1 to
)3.416. Because the p-value (0.072) is smaller than
a ¼ 0:1, we can conclude at 0.1 of significance level thatthe overall liking for the product is the similar between
the two cities in terms of the similarity limit D0 ¼ 0:5.
2.2. Testing power and sample size
The testing power ð1� bÞ or Type II error, b, for theAnderson and Hauck’s non-central t test is a compli-
cated function of a, n, D0, s2 and D1. For simplification,we consider the power only for D1 ¼ 0, i.e., the power
when the true means of the two products for comparison
are the same. The Type II error, b, can be solved
numerically from (2.5).
FmðC � d̂Þ � Fmð�C � d̂Þ � a ¼ 0 ð2:5Þ
142 J. Bi / Food Quality and Preference 16 (2005) 139–149
where C is the 1� b=2 percentage point of the central tdistribution with m degrees of freedom; d̂2 ¼ nD2
0
2s2 .
The sample size, n, needed for specified a, b, D0, s2
and D1 ¼ 0 can also estimated from (2.5). A computerprogram is needed to estimate testing power and sample
size. A S-PLUS program is available from the author on
request.
Table 1
Maximum number of correct responses for similarity testing using the
2-AFC and Duo-trio methods
n a ¼ 0:05 a ¼ 0:1
pd0 pd0
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
5 0 0 0 1 1 0 1 1 1 1
6 0 1 1 1 2 1 1 1 2 2
7 1 1 1 2 2 1 2 2 2 3
8 1 2 2 2 3 2 2 2 3 3
9 2 2 2 3 4 2 3 3 4 4
Example 2.2. For Example 2.1, for a ¼ 0:1, D0 ¼ 0:5,s2 ¼ 1:45, n ¼ 100 (hence m ¼ 2� 100� 2 ¼ 188 for a
completely randomized design), using a S-PLUS pro-
gram based on (2.5), we find the testing power is about
0.9.
We can verify the result as follows: C ¼ tm;1�b=2 ¼t188;0:95 ¼ 1:653; d̂ ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffi100�0:522�1:45
q¼ 2:936. According to
(2.5), F188ð1:653� 2:936Þ � F188ð�1:653� 2:936Þ � 0:1 ¼F188ð�1:286Þ � F188ð�4:589Þ � 0:1 0: It means that forthe specified a, D0, s2, and n values, b ¼ 0:1 (i.e.,1� b ¼ 0:9) is an approximate solution of (2.5).
For a ¼ 0:1, D0 ¼ 0:5, s2 ¼ 1:45, the sample size
needed to reach 0.8 of testing power should be about
n ¼ 77 for a completely randomized design using a S-
PLUS program based on (2.5).
We can also verify the result as follows: C ¼ tm;1�b=2 ¼t152;0:9 ¼ 1:287; d̂ ¼
ffiffiffiffiffiffiffiffiffiffiffi77�0:522�1:45
q¼ 2:576. According to
(2.5), F152ð1:287�2:576Þ�F152ð�1:287�2:576Þ�0:1¼F152ð�1:289Þ � F152ð�3:863Þ � 0:1 0. It means that for
the specified a, D0, s2, and b values, n ¼ 77 is an
approximate solution of (2.5).
10 2 2 3 4 4 2 3 4 4 5
11 2 3 3 4 5 3 4 4 5 5
12 3 3 4 5 5 3 4 5 5 6
13 3 4 5 5 6 4 5 5 6 7
14 4 4 5 6 7 4 5 6 7 7
15 4 5 6 6 7 5 6 6 7 8
16 5 5 6 7 8 5 6 7 8 9
17 5 6 7 8 9 6 7 8 8 9
18 5 6 7 8 9 6 7 8 9 10
19 6 7 8 9 10 7 8 9 10 11
20 6 7 8 10 11 7 8 9 10 11
21 7 8 9 10 11 8 9 10 11 12
22 7 8 10 11 12 8 9 10 12 13
23 8 9 10 11 13 9 10 11 12 14
24 8 9 11 12 13 9 10 12 13 14
25 9 10 11 13 14 10 11 12 14 15
26 9 10 12 13 15 10 11 13 14 16
27 10 11 12 14 15 11 12 13 15 16
28 10 12 13 15 16 11 12 14 15 17
29 11 12 14 15 17 12 13 15 16 18
30 11 13 14 16 17 12 14 15 17 18
35 13 15 17 19 21 14 16 18 20 22
40 16 18 20 22 24 17 19 21 23 25
45 18 21 23 25 28 19 22 24 27 29
50 21 23 26 29 31 22 25 27 30 33
60 26 29 32 35 38 27 30 33 36 40
70 31 34 38 42 45 32 36 39 43 47
80 36 40 44 48 53 37 41 46 50 54
90 41 45 50 55 60 42 47 52 56 61
100 46 51 56 61 67 48 53 58 63 68
3. Similarity testing using forced choice methods
3.1. One-sided interval hypotheses testing
The forced choice methods used conventionally for
the discrimination testing can also be used for similarity
testing but a different statistical testing model is needed.
Firstly we need to specify an allowed or ignorable dif-
ference in terms of the proportion or probability of
‘‘discriminators’’, pd0. The probability of correct re-
sponses, pc0, corresponding to pd0 is then calculated,
pc0 ¼ pd0 þ p0ð1� pd0Þ, where p0 is a guessing probabil-ity, p0 ¼ 1=2 for the 2-AFC and Duo-trio tests and
p0 ¼ 1=3 for the 3-AFC and the triangular tests.
The null and alternative hypotheses of the similarity
testing are
H0 : pc P pc0H1 : p0 6 pc < pc0
This is a one-sided test. The test statistic c is the number
of correct responses in a similarity test with sample size
n. The critical number c0 is the maximum value that
satisfied (3.1).
Xc0x¼0
nx
� �pxc0ð1� pc0Þn�x
< a ð3:1Þ
If observed number of correct responses c is smaller
than or equal to a critical number c0, the null hypothesisis rejected and the alternative hypothesis is accepted at a
significance level a. It means that similarity can be
concluded.
Tables 1 and 2 give the critical numbers for similarity
testing using the 2-AFC, Duo-trio, 3-AFC and Trian-
gular methods for a ¼ 0:05 and 0.1, pd0 ¼ 0:1 to 0.5 witha step of 0.1 and for sample size n ¼ 5 to 100.
Example 3.1. There are 100 panelists in a similaritytesting using the 3-AFC method for sweetness of two
product brands. The allowed proportion of ‘‘discrimi-
nators’’ for the method is selected as pd0 ¼ 0:2 and sig-
Table 2
Maximum number of correct responses for similarity testing using the
3-AFC and Triangular methods
n a ¼ 0:05 a ¼ 0:1
pd0 pd0
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
5 0 0 0 0 1 0 0 0 1 1
6 0 0 0 1 1 0 0 1 1 1
7 0 0 1 1 2 0 1 1 2 2
8 0 0 1 2 2 0 1 1 2 3
9 0 1 1 2 3 1 1 2 3 3
10 1 1 2 2 3 1 2 2 3 4
11 1 1 2 3 4 1 2 3 4 4
12 1 2 3 3 4 2 2 3 4 5
13 1 2 3 4 5 2 3 4 5 5
14 2 3 3 4 5 2 3 4 5 6
15 2 3 4 5 6 3 4 5 6 7
16 2 3 4 5 7 3 4 5 6 7
17 3 4 5 6 7 3 4 5 7 8
18 3 4 5 6 8 4 5 6 7 8
19 3 4 6 7 8 4 5 6 8 9
20 3 5 6 7 9 4 5 7 8 10
21 4 5 6 8 9 5 6 7 9 10
22 4 5 7 8 10 5 6 8 9 11
23 4 6 7 9 11 5 7 8 10 11
24 5 6 8 9 11 6 7 9 10 12
25 5 7 8 10 12 6 7 9 11 13
26 5 7 9 10 12 6 8 10 11 13
27 6 7 9 11 13 7 8 10 12 14
28 6 8 10 12 13 7 9 11 12 14
29 6 8 10 12 14 7 9 11 13 15
30 7 9 11 13 15 8 10 11 14 16
35 8 11 13 15 18 9 12 14 16 19
40 10 13 15 18 21 11 14 16 19 22
45 12 15 17 21 24 13 16 19 22 25
50 13 17 20 23 27 15 18 21 25 28
60 17 21 25 29 33 18 22 26 30 34
70 20 25 29 34 39 22 26 31 36 41
80 24 29 34 40 45 25 31 36 41 47
90 27 33 39 45 52 29 35 41 47 53
100 31 37 44 51 58 33 39 46 53 60
J. Bi / Food Quality and Preference 16 (2005) 139–149 143
nificance level is a ¼ 0:05. The observed number of
correct responses in the test is 35.
We can find from Table 2 that c0 ¼ 37, which is
the maximum value forP37
x¼0100
x
� �� 0:4667xð1�
0:4667Þ100�x< 0:05, where pc0 ¼ 0:2þ 1=3� ð1� 0:2Þ ¼
0:4667, according to (3.1). Because the observed numberof correct responses (35) is smaller than the critical value
(37), we can conclude that the two brands of product are
the similar in sweetness. In other words, we can claimthat there is no detectable difference between the two
brands on sweetness at a significance level a ¼ 0:05 in
terms of pd0 ¼ 0:2.
3.2. Testing power and sample size
The power of the similarity testing is the probability
of making a conclusion of similarity when the true
proportion of ‘‘discriminator’’ is smaller than a specified
allowed or ignorable proportion, i.e., pd1 < pd0, in other
words, the corresponding true probability of correct
responses pc1 is smaller than pc0. The probability shouldbe as (3.2).
Power ¼ 1� b ¼Xc0x¼0
pxc1ð1� pc1Þn�x ð3:2Þ
where pc1 ¼ pd1 þ p0ð1� pd1Þ.Testing power depends on a, pd0, pd1, n and p0. For a
specified forced choice method, the larger the values of
a, pd0 and n, the larger the testing power. The larger thepd1, the smaller the testing power. If a, pd0, n and p0 arefixed, the maximum testing power is reached at pd1 ¼ 0
and the minimum testing power at pd1 ¼ pd0.For specified a, b, pd0 and pd1, the sample size, n, can
be calculated numerically from (3.1) and (3.2). Tables 3
and 4 give sample sizes needed to reach about 0.8 of
power in similarity testing using the 2-AFC, Duo-trio, 3-
AFC and Triangular methods.
Example 3.2. For the example in Example 3.1, now we
can estimate the testing power for an assumed true
proportion of ‘‘discriminator’’. If the assumed true
proportion of ‘‘discriminator’’ is pd1 ¼ 0:05, i.e., pc1 ¼pd1 þ p0ð1� pd1Þ ¼ 0:05þ 1
3� ð1� 0:05Þ ¼ 0:367, accord-
ing to (3.2), the testing power should be
Power ¼ 1� b ¼X37x¼0
0:367x � ð1� 0:367Þ100�x ¼ 0:57
However, if the assumed true proportion of ‘‘discrimi-
nator’’ is pd1 ¼ 0:1, i.e., pc1 ¼ pd1 þ p0ð1� pd1Þ ¼0:1þ 1
3� ð1� 0:1Þ ¼ 0:4, the testing power is only
Power ¼ 1� b ¼X37x¼0
0:4x � ð1� 0:4Þ100�x ¼ 0:31
In order to reach 0.8 of testing power using the 3-
AFC method for pd0 ¼ 0:2, pd1 ¼ 0:05, a ¼ 0:05, fromTable 4, we can find that the number of panelists should
be at least 160.
3.3. Comparison between one-sided interval hypotheses
testing with one-sided confidence interval for forced choice
methods
In sensory and consumer field, the confidence interval
method is often used to evaluate if the proportion of‘‘discriminator’’, pd, is less than the specified allowed
proportion, pd0, or equivalently, if the proportion of
correct responses, pc, is less than the allowed proportion,pc0 (ASTM, 1997; ASTM, 2001). The upper confidence
limit is usually calculated as p̂ þ z1�a
ffiffiffiffiffiffiffiffiffiffip̂ð1�p̂Þ
n
q, where
p̂ ¼ cn, and is compared with pc0.
Although the interval hypotheses testing is connected
with confidence interval, the two approaches are not
Table 3
Sample sizes needed to reach 0.8 of power is similarity testing using the 2-AFC and Duo-trio methods
pd1 a ¼ 0:05 a ¼ 0:10
pd0 pd0
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
0.00 646 160 74 43 25 469 112 57 32 19
0.05 2464 277 103 52 31 1820 207 72 40 24
0.10 624 160 69 36 459 113 57 31
0.15 2443 267 100 51 1769 206 73 38
0.20 602 151 65 438 115 48
0.25 2312 267 89 1740 187 66
0.30 551 141 410 102
0.35 2150 236 1585 175
0.40 511 380
0.45 1957 1435
Table 4
Sample sizes needed to reach 0.8 of power is similarity testing using the 3-AFC and Triangular methods
pd1 a ¼ 0:05 a ¼ 0:10
pd0 pd0
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
0.00 340 89 41 26 15 256 64 34 18 13
0.05 1355 160 61 31 19 990 122 43 23 14
0.10 357 89 44 22 262 73 30 17
0.15 1407 166 65 31 1039 122 41 24
0.20 367 93 42 269 67 32
0.25 1425 160 60 1058 112 45
0.30 352 89 266 66
0.35 1368 154 1027 118
0.40 333 250
0.45 1304 927
144 J. Bi / Food Quality and Preference 16 (2005) 139–149
exactly the same in decision rules and numerical results.
The decision rules for the two approaches are as (3.3)
and (3.4), respectively.
c < c0 ð3:3Þ
p̂ þ z1�a
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffip̂ð1� p̂Þ
n
r< pc0 ð3:4Þ
The decision rule in (3.3) is equal to p̂ < pc0 � k, i.e.,
p̂ þ k < pc0 ð3:5Þwhere k ¼ pc0 � c0
n .Comparing the two decision rules in (3.5) and (3.4), it
is noted that k in (3.5) is a constant and is obtained from
an exact binomial distribution, while z1�a
ffiffiffiffiffiffiffiffiffiffip̂ð1�p̂Þ
n
qin (3.4)
is a random variable and is estimated on the basis of anapproximate normal distribution.
4. Similarity testing using the A–Not A and the Same–
Different methods
In this section, we discuss similarity testing using the
monadic A–Not A and the Same–Different methods,
which are the discrimination methods with response
bias. The methods in this design involve comparison
between two independent proportions. As to the simi-
larity testing for dependent proportions in paired de-
signed A–Not A and the Same–Different methods, it is
not discussed in this article. For the techniques of sim-
ilarity assessment for dependent proportions in paired
design, see, e.g., Lu and Bean (1995), Nam (1997),
Tango (1998), Liu, Hsueh, Hsueh, and Chen (2002) andTang, Tang, and Chan (2003).
4.1. Dunnett and Gent’s chi-squared test
Dunnett and Gent (1977) suggested a chi-squared test
for similarity based on the data in a 2 · 2 table. Let pAand pN denote the probabilities of response ‘‘A’’ for
sample A and for sample Not A, respectively. The null
and alternative hypotheses are (4.1) and (4.2).
H0 : pA � pN ¼ D0 ð4:1ÞH1 : 06 pA � pN < D0 ð4:2Þwhere D0 is an allowable non-zero value defining
equivalence or similarity.
It is necessary to calculate the expected proportions
of response ‘‘A’’ for sample A and Not A assuming a
J. Bi / Food Quality and Preference 16 (2005) 139–149 145
non-zero value for the true difference of the proportions
pA � pN ¼ D0 under the null hypothesis. The expected
proportions are estimated from (4.3) and (4.4).
p̂A ¼ xþ y þ nND0
nA þ nNð4:3Þ
p̂N ¼ xþ y � nAD0
nA þ nNð4:4Þ
where x and y are observed numbers of response ‘‘A’’ for
sample A and Not A, respectively; nA and nN are sample
sizes for sample A and Not A. The expected number of
response ‘‘A’’ for sample A is then x0 ¼ nAp̂A.
Under the null hypothesis in (4.1), the test statistic is
(4.5).
X 2 ¼ ðx� x0Þ2 1
x0
þ 1
m� x0þ 1
nA � x0þ 1
nN � mþ x0
ð4:5Þ
where m ¼ xþ y. With continuity correction, (4.5) be-
comes (4.6).
X 2 ¼ ðjx� x0j � 0:5Þ2 1
x0
þ 1
m� x0þ 1
nA � x0
þ 1
nN � mþ x0
ð4:6Þ
The test statistic, X 2, follows the chi-square distri-
bution with one degree of freedom. Because it is as-
sumed that the proportion of response ‘‘A’’ for sample
A is not smaller than the proportion of response ‘‘A’’ for
sample Not A, this test is one-sided. The p-value shouldbe obtained dividing the tail area of the chi-square dis-
tribution by 2.
An alternative test statistic is (4.7), which follows
approximately the standard normal distribution under
the null hypothesis. We can reject the null hypothesis in
(4.1) and accept the alternative hypothesis in (4.2) at a asignificance level if the value of the statistic is smaller
than the a quantile of the standard normal distribution,i.e., Z < za. The p-value is the probability of Z < za.z0:05 ¼ �1:64 and z0:1 ¼ �1:28.
Z ¼ p̂A � p̂N � D0ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffibV ðp̂A � p̂NÞq ð4:7Þ
where bV ðp̂A � p̂NÞ is estimated variance of p̂A � p̂N un-
der the null hypothesis. With continuity correction, (4.7)becomes (4.8).
Z ¼ p̂A � p̂N � D0 þ n0ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffibV ðp̂A � p̂NÞq ð4:8Þ
where n0 ¼ ð1=nA þ 1=nNÞ=2.There are different methods for estimation of the
variance. One method is to use expected proportion, p̂A
and p̂N in (4.3) and (4.4), rather than the observed
proportions, p̂A and p̂N for estimation of the variance
(see, e.g., Roadary, Com-Nougue, & Tournade, 1989).
The estimated variance using the expected proportions is
(4.9).
bV ðp̂A � p̂NÞ ¼p̂Að1� p̂AÞ
nAþ p̂Nð1� p̂NÞ
nNð4:9Þ
It should be noted that if the difference of the two
estimated proportions is larger than the allowable dif-
ference, i.e., p̂A � p̂N > D0, the similarity testing should
stop, because the similarity cannot be concluded at any
a meaningful significance level, a, in the situation.
It can be shown algebraically that in the difference
testing using the A–Not A method, the chi-square test
with one degree of freedom is exactly equivalent to a Ztest for comparison of two independent proportions
based on a normal approximation, provided that a
pooled proportion is used in estimate of a common
variance in null hypothesis (Snedecor & Cochran, 1989).
However, it is noted that these two tests are no longer
exactly equivalent in the similarity testing because of a
different variance estimator in a Z test statistic.
Example 4.1. In order to make sure if a product (sample
Not A) with substituted ingredients has the similar
sensory characteristic with the current product (sampleA), a similarity testing for two products was conducted
using a monadic A–Not A method. 200 panelists re-
ceived A sample and 200 received Not A sample, i.e.,
nA ¼ nN ¼ 200. The specified allowable limit defining
similarity is selected as 0.1. It means that we regard the
two products as similarity if the difference of the pro-
portions of response ‘‘A’’ for sample A and sample Not
A is not larger than 0.1.The observed numbers of response ‘‘A’’ for sample A
and sample Not A are x ¼ 45 and y ¼ 39, respectively.
Hence m ¼ 45þ 39 ¼ 84. According to (4.3), the ex-
pected proportion of response ‘‘A’’ for sample A is
p̂A ¼ xþyþnND0
nAþnN¼ 45þ39þ200�0:1
200þ200 ¼ 0:26. Hence the expected
number x0 ¼ 200� 0:26 ¼ 52. The value of the test sta-
tistic in (4.5) is then
X 2 ¼ ð45� 52Þ2 1
52
þ 1
84� 52þ 1
200� 52
þ 1
200� 84þ 52
¼ 3:096
The value of 3.096 is the 0.922 quantile of the chi-square
distribution with one degree of freedom. The tail area is0.078. The p-value of one-sided test is then 0.078/
2¼ 0.039. We can conclude at a 0.05 significance level
that the two products are similar in terms of 0.1 allow-
able limit defining equivalence.
If the statistic of normal approximation in (4.7) is
used, firstly we calculate the expected proportions under
the null hypothesis.
146 J. Bi / Food Quality and Preference 16 (2005) 139–149
p̂A ¼ xþ y þ nND0
nA þ nN¼ 0:26
p̂N ¼ xþ y � nND0
nA þ nN¼ 45þ 39� 200� 0:1
200þ 200¼ 0:26
According to (4.9), the estimated variance of p̂A � p̂Nunder the null hypothesis is then
bV ðp̂A � p̂NÞ ¼0:26� ð1� 0:26Þ
200þ 0:16� ð1� 0:16Þ
200
¼ 0:00163
The value of the test statistic (4.7) is
Z ¼ 45=200� 39=200� 0:1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi0:00163
p ¼ �1:73
with the associated p-value¼ 0.042. It shows good
agreement between the chi-square approach and thenormal approximate approach.
4.2. Testing power and sample size
The power for similarity testing using the A–Not A
and the Same–Different methods is the probability of
concluding similarity when the true difference ðD1Þ of theproportions of response ‘‘A’’ for sample A and Not A is
smaller than a specified similarity limit ðD0Þ under thealternative hypothesis. It is
Power ¼ 1� b ¼ Pp̂A � p̂N � D0ffiffiffiffiffi
V0p
< zajH1
!ð4:10Þ
where V0 denotes the variance of p̂A � p̂N under the nullhypothesis. Eq. (4.10) is equivalent to (4.11).
1� b ¼ Pp̂A � p̂N � D1ffiffiffiffiffi
V1p
<
zaffiffiffiffiffiV0
pþ ðD0 � D1Þffiffiffiffiffi
V1p
����H1
!ð4:11Þ
where V1 denotes the variance of p̂A � p̂N under the
alternative hypothesis. Because p̂A�p̂N�D1ffiffiffiffiV1
p is an approxi-
mate standard normal statistic under the alternativehypothesis, the testing power can be calculated from
(4.12).
Power ¼ 1� b ¼ P Z�
<za
ffiffiffiffiffiV0
pþ ðD0 � D1Þffiffiffiffiffi
V1p
�ð4:12Þ
The variances of p̂A � p̂N under the null and the alter-
native hypotheses are (4.13) and (4.14), respectively.
V0 ¼pNð1� pNÞ
nNþ ðpN þ D0Þð1� pN � D0Þ
nAð4:13Þ
V1 ¼pNð1� pNÞ
nNþ ðpN þ D1Þð1� pN � D1Þ
nAð4:14Þ
From (4.12), we can see that in order to calculate a
testing power, the values of the six characteristics: a, D0,
D1, pN, nA and nN, should be given or assumed. The
larger the values of a, D0, nA and nN are, the larger
the testing power is. On the other hand, the smaller the
values of D1, and pN are, the larger the testing power is.A small pN means that panelists have small probability
of response ‘‘A’’ for Not sample. A small D1 value
means that the difference between the two true proba-
bilities pA and pN in an alternative hypothesis is small.
Testing power is a complement of Type II error b. TypeII error b is a probability of failure to reject the null
hypothesis of inequivalence when the two true proba-
bilities pN and pA in fact are similar.From (4.12), a sample size formula can be derived
from (4.15).
nN ¼z1�b
ffiffiffiffiffiV 01
pþ z1�a
ffiffiffiffiffiV 00
pD0 � D1
" #2ð4:15Þ
where
V 00 ¼ pNð1� pNÞ þ ðpN þ D0Þð1� pN � D0Þ=h
V 01 ¼ pNð1� pNÞ þ ðpN þ D1Þð1� pN � D1Þ=h
h ¼ nA=nN
The ratio of sample sizes of sample A and sample Not
A, i.e., h, should be predetermined. The same samplesize for sample A and sample Not A, i.e., h ¼ 1, is often
adopted.
Example 4.2. In Example 4.1, nA ¼ nN ¼ 200. If
a ¼ 0:1;D0 ¼ 0:2 are selected, pN ¼ 0:2 and D1 ¼ 0:1 areassumed, the testing power can be calculated. The
variances of p̂A � p̂N under the null and the alternative
hypotheses are
V0 ¼0:2� ð1� 0:2Þ
200þ ð0:2þ 0:2Þð1� 0:2� 0:2Þ
200
¼ 0:002
V1 ¼0:2� ð1� 0:2Þ
200þ ð0:2þ 0:1Þð1� 0:2� 0:1Þ
200
¼ 0:00185
According to (4.12), the power should be
1� b ¼ P Z
<
�1:28�ffiffiffiffiffiffiffiffiffiffiffi0:002
pþ ð0:2� 0:1Þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
0:00185p
!¼ P ðZ < 0:992Þ ¼ 0:84
For the same situation: a ¼ 0:1, D0 ¼ 0:2, pN ¼ 0:2and D1 ¼ 0:1, the sample size needed to reach 0.8 ofpower can be calculated from (4.15). Because
V 00 ¼ 0:2� ð1� 0:2Þ þ ð0:2þ 0:2Þð1� 0:2� 0:2Þ=1¼ 0:4
V 01 ¼ 0:2� ð1� 0:2Þ þ ð0:2þ 0:1Þð1� 0:2� 0:1Þ=1¼ 0:37
J. Bi / Food Quality and Preference 16 (2005) 139–149 147
the sample size for sample A and Not A should be
nA ¼ nN ¼ 0:84�ffiffiffiffiffiffiffiffiffi0:37
pþ 1:28�
ffiffiffiffiffiffiffi0:4
p
0:2� 0:1
" #2¼ 175
5. Similarity testing for preference proportions
In order to do the tests, we first estimate the pro-
portions of preferences, p̂a and p̂b as well as the
covariance of p̂a and p̂b, where p̂a and p̂b are the esti-
mates of proportions preferring product A and B,p̂a þ p̂b 6 1. There are different models for estimating
preference proportions and their variances and covari-
ance, see, e.g., Ferris (1958), Bliss (1960), Gridgeman
(1960), Horsnell (1969, 1977), Wierenga (1974), Hutch-
inson (1979) and Ennis & Bi (1999). A detailed discus-
sion of the estimation problem for preference
proportions is beyond the scope of this article. One
point that should be noted is that estimating consumerpreference proportion in a preference test is quite dif-
ferent from, and more complicated than, estimating the
proportion of ‘‘discriminators’’ in a discrimination test,
because two independent parameters rather than one
independent parameter are involved in preference test. It
seems that there is not a simple non-replicated proce-
dure for estimating preference proportions. A famous
procedure is the Ferris’s k-visit method using maximumlikelihood estimate (Ferris, 1958). In this method, ‘‘no
preference’’ option is allowed. For the 2-visit method,
each of consumer panelists is either visited twice or
asked to judge twice the same pair of products A and B.
The total N panelists can then be classified into nine
different categories according to their responses in the
two judgements. Based on the data, the preference
proportions and their covariance matrix can be esti-mated.
5.1. Schuirmann’s two one-sided test
The objective is to test if the difference of the pro-
portions preferring product A and B, jpa � pbj, is smallerthan a specified allowed value, D0. This test involves two
sets of one-sided hypotheses:
H01 : pa � pb 6 � D0 versus
H11 : pa � pb > �D0
ð5:1Þ
and
H02 : pa � pb P D0 versus
H12 : pa � pb < D0
ð5:2Þ
The first set of hypotheses in (5.1) is to test for non-
inferiority of product A to product B. The second set of
hypotheses in (5.2) is to test for non-superiority of
product A to product B. We can declare the two prod-
ucts are similar in preference if and only if both H01 and
H02 are rejected at a significance level a.The test statistics are (5.3) and (5.4), which follow
approximately a standard normal distribution.
Z1 ¼ðp̂a � p̂bÞ þ D0
r̂ð5:3Þ
Z2 ¼ðp̂a � p̂bÞ � D0
r̂ð5:4Þ
where r̂ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiV ðp̂aÞ þ V ðp̂bÞ � 2Covðp̂a; p̂bÞ
p. We can
conclude that the two proportions of preferences for
products A and B in a specified consumer population
are equivalent if Z1 > z1�a and Z2 < za.
Example 5.1.We use the results in Ferris (1958). In this
example, ‘‘no preference’’ option is allowed in the pre-
ference testing. The estimated values are p̂a ¼ 0:4968,p̂b ¼ 0:3702, V ðp̂aÞ ¼ 0:000296, V ðp̂aÞ ¼ 0:000277 and
Covðp̂a; p̂bÞ ¼ �0:000198. If D0 ¼ 0:2; a ¼ 0:05 are spec-ified, the values of the test statistics in (5.3) and (5.4) can
be obtained.
r̂ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi0:000296þ 0:000277þ 2� 0:000198
p¼ 0:0311
Z1 ¼ð0:4968� 0:3702Þ þ 0:2
0:0311¼ 10:5
and
Z2 ¼ð0:4968� 0:3702Þ � 0:2
0:0311¼ �2:36
Because Z1 > z0:95 ¼ 1:64, and Z2 < z0:05 ¼ �1:64, hencewe can conclude that the two products are similar in
preference.
If we specified D0 ¼ 0:1 rather then D0 ¼ 0:2, then
Z1 ¼ð0:4968� 0:3702Þ þ 0:1
0:0311¼ 7:29 and
Z2 ¼ð0:4968� 0:3702Þ � 0:1
0:0311¼ �0:86
Because Z1 > z0:95 ¼ 1:64, but Z2 > z0:05 ¼ �1:64, hencewe cannot conclude that the two products are similar in
preference. We can, however, claim that product A is
non-inferior to product B at a 0.05 significance level.
Further information is needed to verify product A issimilar or superior to product B.
5.2. Testing power and sample size
Let p̂ ¼ p̂a � p̂b. In the interval hypotheses testing for
preference proportions based on Eqs. (5.3) and (5.4), the
null hypothesis of difference will be rejected and the
alternative hypothesis of similarity will be accepted at
the a level of significance if p̂þD0
r̂ > z1�a andp̂�D0
r̂ < za. Inother words, the rejection region is
148 J. Bi / Food Quality and Preference 16 (2005) 139–149
�D0 þ z1�ar̂ < p̂ < D0 þ zar̂ ð5:5Þ
The testing power is the probability of correctly con-
cluding similarity when p̂ falls into the rejection region.
If the true difference of the two preference proportions is
p0, the testing power is then
Power ¼ 1� b ¼ Prf�D0 þ z1�ar̂ � p0 < p̂ � p0
< D0 þ zar̂ � p0g; i:e:;
Power ¼ 1� b ¼ Pr�D0 þ z1�ar̂ � p0
r̂
(<
p̂ � p0r̂
<D0 þ zar̂ � p0
r̂
)ð5:6Þ
Because p̂�p0r̂ follows asymptotically the standard normal
distribution, hence testing power is as (5.7).
Power ¼ 1� b ¼ UðaÞ � UðbÞ ð5:7Þ
where a ¼ D0þzar̂�p0r̂ ; b ¼ �D0þz1�ar̂�p0
r̂ and b is the proba-
bility of failing to reject the null hypothesis of difference
when it is false.
Note that the magnitude of r̂ depends on the methods
to estimate the preference proportions. In each of themethods, r̂ contains a component of sample size N .Hence r̂ ¼ r̂0ffiffiffi
Np , where r̂0 is a component of r̂ indepen-
dent of sample size.
The testing power for non-inferiority should be in
(5.8) and for non-superiority in (5.9).
Power ¼ 1� b ¼ 1� UðbÞ ð5:8Þ
Power ¼ 1� b ¼ UðaÞ ð5:9Þ
For specified D0, a, b and assumed p0, and r̂0, we can
estimate the effective sample size, N , from (5.10), which
is derived from (5.6).
N r̂20ðz1�b þ z1�aÞ2
ðD0 � p0Þ2ð5:10Þ
Example 5.2. For D0 ¼ 0:1; a ¼ 0:05; r̂0 ¼ 0:2;N ¼ 100,
if the true difference of preference proportions is as-
sumed 0.05, according to (5.7), the power of the simi-
larity testing for preference can be calculated as follows:
a ¼ 0:1� 1:64� 0:2=ffiffiffiffiffiffiffiffi100
p� 0:05
0:2=ffiffiffiffiffiffiffiffi100
p ¼ 0:855
b ¼ �0:1þ 1:64� 0:2=ffiffiffiffiffiffiffiffi100
p� 0:05
0:2=ffiffiffiffiffiffiffiffi100
p ¼ �5:855
Hence Power ¼ 1� b ¼ Uð0:855Þ � Uð�5:855Þ ¼ 0:80.On the other hand, for D0 ¼ 0:1; a ¼ 0:05; r̂0 ¼ 0:2,
we want to know the sample size, i.e., the number of
panelists, needed to reach 0.8 of power for a preference
testing. From (5.10), we can estimate
N 0:22 � ð0:842þ 1:645Þ2
ð0:1� 0:05Þ2 99
The value of r̂0 should be obtained from some prior
information or from a small pilot experiment.
6. Conclusions
Similarity testing is an important methodology of
sensory and consumer research. The ‘‘power approach’’,
which is prevalent in the sensory and consumer field, isinadequate because it contorts the logic of hypothesis
testing. One weakness of this approach is that for a large
sample size and a small measurement error, it is unlikely
to draw a conclusion of similarity even for a slight dif-
ference but effective equivalence. As an alternative, the
interval hypotheses testing, which is originally devel-
oped for assessment of bioequivalence in clinical and
pharmaceutical fields, is a valid approach for similaritytesting. Based on this approach, some procedures for
similarity testing are discussed and proposed for differ-
ent situations in sensory and consumer research. The
procedures include similarity testing for hedonic or
intensity rating means, similarity testing using conven-
tional discrimination methods and similarity testing for
preference proportions. Tables of maximum number of
correct responses for the similarity tests using the 2-AFC, Duo-trio, 3-AFC and Triangular methods are
provided in the paper. Tables of sample sizes needed to
reach 0.8 of power in the similarity tests using the forced
choice methods are also provided.
Acknowledgements
The author would like to thank the Editor and two
anonymous referees for their constructive comments on
the earlier version of the article.
References
Altman, D. G., & Bland, J. M. (1995). Absence of evidence is not
evidence of absence. British Medical Journal, 311, 485.
Anderson, S., & Hauck, W. W. (1983). A new procedure for testing
equivalence in equivalence in comparative bioavailability and other
clinic trials. Communications in Statistics, A.12, 2663–2692.
ASTM (1997). Standard test method for sensory analysis––Triangle
test. ASTM E1885-97.
ASTM (2001). Standard test method for directional difference test.
ASTM E2164-01.
Barker, L., Rolka, H., Rolka, D., & Brown, C. (2001). Equivalence
testing for binomial random variables: Which test to use? The
American Statistician, 55(4), 279–287.
Berger, R. L., & Hsu, J. C. (1996). Bioequivalence trials, intersection–
union tests and equivalence confidence sets (with Discussion).
Statistical Science, 11, 283–319.
Blackwelder, W. C. (1982). ‘‘Proving the null hypothesis’’ in clinical
trials. Controlled Clinical Trials, 3, 345–353.
J. Bi / Food Quality and Preference 16 (2005) 139–149 149
Bliss, C. I. (1960). Some statistical aspects of preference and related
tests. Applied Statistics, 9, 8–19.
Carr, B. T. (1995). Confidence intervals in the analysis of sensory
discrimination tests––The integration of similarity and difference
testing. In Proceedings of 4th AgoStat Dijon, 7–8 December 1995,
pp. 23–31.
Chow, S. C., & Lui, J. P. (1992). Design and analysis of bioavailability
and bioequivalence studies. New York: Marcel Dekker.
Dunnett, C. W., & Gent, M. (1977). Significance testing to establish
equivalence between treatments, with special reference to data in
the form of 2· 2 tables. Biometrics, 33, 593–602.
Ennis, D. M., & Bi, J. (1999). The Dirichlet-multinomial model:
Accounting for intertrial variation in replicated ratings. Journal of
Sensory Studies, 14, 321–345.
FDA (1992). Bioavailability and Bioequivalence Requirements. In US
code of federal regulations (Vol. 21), Washington, DC: US
Government Printing Office.
Ferris, G. E. (1958). The k-visit method of consumer testing.
Biometrics, 14, 39–49.
Gacula, M. C., Jr. (1991). Claim substantiation for sensory equivalence
and superiority. In H. T. Lawless & B. P. Klein (Eds.), Sensory
science theory and applications in foods. New York: Marcel Dekker.
Gridgeman, N. T. (1960). Statistics and taste testing. Applied Statistics,
9, 103–112.
Hauck, W. W., & Anderson, S. (1984). A new statistical procedure
for testing equivalence in two-group comparative bioavailability
trials. Journal of Pharmacokinetics and Biopharmaceutics, 12, 72–
78.
Hauschke, D., Steinijans, V. W., & Diletti, E. (1990). A distribution-
free procedure for the statistical analyses of bioequivalence studies.
International Journal of Clinical Pharmacology and Therapeutics
and Toxicology, 28, 72–78.
Horsnell, G. (1969). A theory of consumer behavior derived from
repeat paired preference testing (with Discussion). Journal of the
Royal Statistical Society Series A, 132, 164–192.
Horsnell, G. (1977). Paired comparison product testing when individ-
ual preferences are stochastic: An alternative model. Applied
Statistics, 26, 162–172.
Hsu, J. C., Hwang, J. T. G., Lui, H.-K., & Ruberg, S. J. (1994).
Confidence intervals associated with tests for bioequivalence.
Biometrika, 81, 103–114.
Hutchinson, T. P. (1979). A comment on replicated paired compar-
isons. Applied Statistics, 28, 163–169.
Liu, J.-P., Hsueh, H.-M., Hsueh, E., & Chen, J. J. (2002). Tests for
equivalence or non-inferiority for paired binary data. Statistics in
Medicine, 21, 231–245.
Lu, Y., & Bean, J. A. (1995). On the sample size for one-sided
equivalence of sensitivities based upon McNemar’s test. Statistics
in Medicine, 14, 1831–1839.
MacRae, A. W. (1995). Confidence intervals for the triangle test can
give reassurance that products are similar. Food Quality and
Preference, 6, 61–67.
Mandallaz, D., & Mau, J. (1981). Comparison of different methods for
decision-making in bioequivalence assessment. Biometrics, 37, 213–
222.
Meilgaard, M., Civille, G. V., & Carr, B. T. (1991). Sensory evaluation
techniques (2nd ed.). Boca Raton: CRC Press.
Metzler, C. M., & Haung, D. C. (1983). Statistical methods for
bioavailability and bioequivalence. Clinical Research and Practices
& Drug Regulatory Affairs, 1, 109–132.
Nam, J.-M. (1997). Establishing equivalence of two treatments and
sample size requirements in matched–paired design. Biometrics, 53,
1422–1430.
O’Quigley, J., & Baudoin, C. (1988). General approaches to the
problem of bioequivalence. The Statistician, 37, 51–58.
Patel, H. I., & Gupta, G. D. (1984). A problem of equivalence in
clinical trials. Biometrical Journal, 26, 471–474.
Rashid, M. M. (2003). Rank-based tests for non-inferiority and
equivalence hypotheses in multi-centre clinical trials using mixed
models. Statistics in Medicine, 22, 291–311.
Roadary, C., Com-Nougue, C., & Tournade, M.-F. (1989). How to
establish equivalence between treatments: a one sided clinical trial
in paediatric oncology. Statistics in Medicine, 8, 593–598.
Rocke, D. M. (1984). On testing for bioequivalence. Biometrics, 40,
225–230.
Rodda, B. E., & Davis, R. L. (1980). Determining the probability of an
important difference in bioavailability. Clinical Pharmacology &
Therapeutics, 28, 247–252.
Schuirmann, D. J. (1981). On hypothesis testing to determine if the
mean of a normal distribution is contained in a known interval.
Biometrics, 37, 617.
Schuirmann, D. J. (1987). A comparison of the two one-sided tests
procedure and the power approach for assessing the equivalent of
average bioavailability. Journal of Pharmacokinetic and Biophar-
maceutics, 15, 657–680.
Selwyn, M. R., Dempster, A. P., & Hall, N. R. (1981). A Bayesian
approach to bioequivalence for the 2· 2 changeover design.
Biometrics, 37, 11–21.
Senn, S. J. (1997). Statistical issues in drug development. Chichester:
Wiley.
Snedecor, G. W., & Cochran, G. C. (1989). Statistical methods (8th
ed.). Ames: Iowa State University Press.
Steinijans, V. W., & Diletti, E. (1983). Statistical analysis of bioavail-
ability studies: Parametric and nonparametric confidence intervals.
European Journal of Clinical Pharmacology, 24, 127–136.
Tang, N.-S., Tang, M.-L., & Chan, I. S. F. (2003). On tests of
equivalence via non-unity relative risk for match–pair design.
Statistics in Medicine, 22, 1217–1233.
Tango, T. (1998). Equivalence test and confidence interval for the
difference in proportions for the paired-sample design. Statistics in
Medicine, 17, 891–908.
Westlake, W. J. (1972). Use of confidence intervals in analysis of
comparative bioavailability trials. Journal of Pharmaceutical Sci-
ence, 61, 1340–1341.
Westlake, W. J. (1976). Symmetrical confidence intervals for bioequi-
valence trials. Biometrics, 32, 741–744.
Westlake, W. J. (1979). Statistical aspects of comparative bioavail-
ability trials. Biometrics, 35, 273–280.
Wierenga, B. (1974). Paired comparison product testing when
individual preference are stochastic. Applied Statistics, 23, 384–396.