On robust testing for normality in chemometrics

25
On robust testing for normality in chemometrics Stehl´ ık, M. a,, Stˇ relec, L. b , Thulin, M. c a Department of Applied Statistics, Johannes Kepler University in Linz, Austria b Department of Statistics and Operational Analysis, Mendel University in Brno, Czech Republic c Department of Mathematics, Uppsala University, Sweden Abstract The assumption that the data has been generated by a normal distribution underlies many sta- tistical methods used in chemometrics. While such methods can be quite robust to small deviations from normality, for instance caused by a small number of outliers, common tests for normality are not and will often needlessly reject normality. It is therefore better to use tests from the little-known class of robust tests for normality. We illustrate the need for robust normality testing in chemo- metrics with several examples, review a class of robustified omnibus Jarque-Bera tests and propose a new class of robustified directed Lin-Mudholkar tests. The robustness and power of several tests for normality is compared in a large simulation study. The new tests are robust and have high power in comparison with both classic tests and other robust tests. A new graphical method for assessing normality is also introduced. Keywords: trimming, Lehmann-Bickel functional, model diagnostics, Monte Carlo simulations, power comparison, robust tests for normality. 1. Introduction Classic parametric statistical significance tests, such as analysis of variance and least squares re- gression, are widely used by researchers in many disciplines of chemistry. For classic parametric tests to produce accurate results, the assumptions underlying them (e.g. normality and homoscedasticity) must be satisfied. These assumptions are rarely met when analyzing real data. The use of classic parametric methods with violated assumptions can result in inaccurate computations of p-values, effect sizes, and confidence intervals. This may lead to substantial errors in the interpretation of data. For this reason, model diagnostics in general and testing for normality in particular are very important issues in chemometrics. But it is often the case that it is not necessary for the underlying distribution to be exactly normal for a statistical method to be valid. Except for situations where the sample size is extremely small, the question that really is of interest is whether the distribution is * Corresponding author. Department of Applied Statistics,Johannes Kepler University in Linz. Altenberger Strasse 69, 4040 Linz a. D., Austria, Email: [email protected], Phone: +43 732 2468 6806, Fax: +43 732 2468 9846 Email addresses: [email protected] (Stehl´ ık, M.), [email protected] (Stˇ relec, L.), [email protected] (Thulin, M.) Preprint submitted to Chemometrics and Intelligent Laboratory Systems October 24, 2013

Transcript of On robust testing for normality in chemometrics

On robust testing for normality in chemometrics

Stehlık, M.a,∗, Strelec, L.b, Thulin, M.c

aDepartment of Applied Statistics, Johannes Kepler University in Linz, AustriabDepartment of Statistics and Operational Analysis, Mendel University in Brno, Czech Republic

cDepartment of Mathematics, Uppsala University, Sweden

Abstract

The assumption that the data has been generated by a normal distribution underlies many sta-tistical methods used in chemometrics. While such methods can be quite robust to small deviationsfrom normality, for instance caused by a small number of outliers, common tests for normality arenot and will often needlessly reject normality. It is therefore better to use tests from the little-knownclass of robust tests for normality. We illustrate the need for robust normality testing in chemo-metrics with several examples, review a class of robustified omnibus Jarque-Bera tests and proposea new class of robustified directed Lin-Mudholkar tests. The robustness and power of several testsfor normality is compared in a large simulation study. The new tests are robust and have highpower in comparison with both classic tests and other robust tests. A new graphical method forassessing normality is also introduced.

Keywords:trimming, Lehmann-Bickel functional, model diagnostics, Monte Carlo simulations, powercomparison, robust tests for normality.

1. Introduction

Classic parametric statistical significance tests, such as analysis of variance and least squares re-gression, are widely used by researchers in many disciplines of chemistry. For classic parametric teststo produce accurate results, the assumptions underlying them (e.g. normality and homoscedasticity)must be satisfied. These assumptions are rarely met when analyzing real data. The use of classicparametric methods with violated assumptions can result in inaccurate computations of p-values,effect sizes, and confidence intervals. This may lead to substantial errors in the interpretation ofdata.

For this reason, model diagnostics in general and testing for normality in particular are veryimportant issues in chemometrics. But it is often the case that it is not necessary for the underlyingdistribution to be exactly normal for a statistical method to be valid. Except for situations wherethe sample size is extremely small, the question that really is of interest is whether the distribution is

∗Corresponding author. Department of Applied Statistics,Johannes Kepler University in Linz. Altenberger Strasse69, 4040 Linz a. D., Austria, Email: [email protected], Phone: +43 732 2468 6806, Fax: +43 732 2468 9846

Email addresses: [email protected] (Stehlık, M.), [email protected] (Strelec, L.),[email protected] (Thulin, M.)

Preprint submitted to Chemometrics and Intelligent Laboratory Systems October 24, 2013

approximately normal. For an overview and discussion of robust chemometrical statistical methods,both parametric and non-parametric, see refs. [30] and [14] and the references therein.

A test for normality that is less sensitive to small deviations from normality, particularly inthe form of a few “bad” observations, is called robust. A drawback of virtually all common testsfor normality is that they lack robustness and are far too sensitive to outliers, rejecting normalityeven when methods that require normality would be applicable. In recent years, several studieson robustification of tests for normality, aiming to correct such drawbacks, have appeared in theliterature; see refs. [12, 7, 11, 33].

While books on, e.g. analytical chemistry generally contain a section devoted to normalitytesting (see e.g. refs. [25, 5]), there are still several open problems related to robust testing fornormality, both in theory and in practice. The aim of this paper is to contribute to this discussion.The next section illustrates the necessity of robust testing for normality in chemometrics. Thereinwe discuss, using real data examples, the importance of robust testing for normality in measuringof mycolic acid and troposphere methane modeling and methane emissions from sedge-grass-marsh.In Section 3 we review the recently introduced class of (omnibus) robustified Jarque-Bera testsand introduce a class of (directed) robustified Lin-Mudholkar tests. In Section 4 we conduct anextensive comparison of the robustness and power of several tests for normality. Section 5 addressesrobust graphical methods for assessing normality. The paper concludes with a discussion, in whichpractical guidelines for robust testing of normality are given and the merits of omnibus and directedtests are compared. To maintain the continuity of explanation, proofs and technicalities are deferredto an appendix.

We emphasize that in this paper we assume that the possible contamination of the sample is dueto outliers. Thus we use two techniques for outliers filtering: trimming and the functional approachintroduced by Bickel and Lehmann (see ref. [4]). For normality testing when the whole distributionmay be contamined, i.e. when any quantile level of the distribution may be contamined see e.g.ref. [1].

2. Motivation of robust testing for normality in chemometrics

In this section we illustrate some problems that arise when classic tests for normality, such asthe Shapiro-Wilk and Jarque-Bera tests, are used in chemometrics. This shows the need for robusttesting for normality in chemometrics.

2.1. Mycolic acid: a data set contamined with a single outlier

Mycolic acids are the major components of the cell walls of Mycobacterium tuberculosis andtheir biochemical properties are paramount to the pathogenesis and survival of these bacteria. Forthis reason, attempts have been made to create drugs that inhibit mycolic acid synthesis. A dataset consisting of 26 measurements of mycolic acids in M. tuberculosis is visualized in Figure 2.1.The sample is contamined by a single outlier. For this data set, the popular Shapiro-Wilk test fornormality gives a p-value of 0.006. However, if the outlier is removed the p-value is 0.41. Clearlya single outlier can have a huge effect on the result of the Shapiro-Wilk test. In many situations,it would be of greater interest to have a test that looks at the overall shape of the empiricaldistribution rather than a test that rejects normality because of a small deviation in the form of anoutlier. An example of such a test is the RLMγ test presented in Section 3, that has higher powerthan the Shapiro-Wilk test in many settings, but yields a p-value of 0.98 for the 26 mycolic acidmeasurements.

2

Scatterplot Histogram QQ−plot

Figure 1: Mycolic acid measurements with an outlier

2.2. Methane modeling in the troposphere and on the Earth

Statistical modeling plays a central role in studies of methane emission and methane absorption,both in the atmosphere and on the ground. For examples, see e.g. ref. [40] for methane in thetroposphere, ref. [21] regarding the methane emissions from natural wetlands or ref. [19] formodeling of the methane emission from a sedge-grass marsh in South Bohemia. Such a modeling isof tremendous complexity and typically requires several distributional assumptions when statisticallearning is desired.

It is understandable that outliers from any reasonable distribution are expected and thus theneed for robust testing arises. Even for constructing optimal sampling plans we need distributionalassumptions, e.g. on error structure (see ref. [29]). Thus robust testing for normality can be a veryuseful tool for model diagnostics in this setup.

For illustratory purposes, we consider four methane data sets. The first is the flux rate ofmethane k1, taken from Table 1 in ref. [40]. For this data, both the popular Shapiro-Wilk andJarque-Bera tests for normality reject normality (the p-values are < 0.001 and 0.002, respectively),as does the new robust RLMγ test (its p-value being < 0.001). The non-normality is likely not dueto outliers, but to the dependence structure, which was modeled in ref. [29].

The remaining data sets are from a study of methane emission from a sedge-grass marsh inSouth Bohemia. The residuals Z,Z−, Z+ of methane emissions taken from infinite moving averagemodel (8) in ref. [19], where only time is taken as a regressor. The Shapiro-Wilk, Jarque-Bera andRLMγ tests reject normality for all three sets of residuals, with p-values < 0.001. The reason forthis non-normality of data is not outliers, but a heavy-tailed pattern, described in ref. [19].

2.3. Asking the right question

Summarizing our observations from the previous examples, we conclude that it is rare to handleexact normality in experiments. For virtually all statistical methods the inference will however alsobe valid for approximately normal random variables. The question that we should ask is thereforenot “are these random variables normal?” but “are these random variables normal enough?” Whatcan be considered “normal enough” depends on the statistical method and the sample size.

Consider, for instance, Student’s t-test. Assume that we have a sample from a random variableX. As it turns out, the non-normality of X can be quantified using the concepts of asymmetry

and peakedness. The asymmetry of X is usually measured by its skewness γ = E(X−µ)3

σ3 . If X issymmetric about its mean, γ = 0. If X “leans to the right” then γ > 0, and we say that X is

3

right-skew. Similarly, X is left-skew if γ < 0. The peakedness is measured by the (excess) kurtosis

κ = E(X−µ)4

σ4 − 3. If X is normal, κ = 0, whereas short-tailed distributions tend to have κ < 0 andheavy-tailed distributions tend to have κ > 0.

To understand how skewness and kurtosis can be used to measure non-normality in the contextof Student’s t-test, we need some tools from theoretical statistics. If Tn is the test statistic ofStudent’s t-test, we can, using a so-called Edgeworth expansion (ref. [16]), obtain the followingapproximation of the null distribution of Tn:

P (Tn ≤ x) ≈ Φ(x) + n−1/2 1

6γ(2x2 + 1)ϕ(x)

− n−1x( 1

12κ(x2 − 3)− 1

18γ2(x4 + 2x2 − 3)− 1

4(x2 + 3)

)ϕ(x),

where Φ(·) is the cumulative distribution function of the standard normal distribution and ϕ(·) isits density function. We can therefore see how skewness and kurtosis affect the null distribution ofthe test statistic.

When X truly is normal, γ = κ = 0 and we see that P (Tn ≤ x) ≈ Φ(x). If X is non-normaland γ or κ are nonzero, however, P (Tn ≤ x) is perturbed by skewness or kurtosis and the size andp-values of the test will no longer behave as desired. The performance of Tn is therefore sensitiveto deviations from normality in the form of skewness and kurtosis.

If we wish to investigate whether X is “normal enough” it makes sense to measure non-normalityin terms of skewness and kurtosis, as these quantities directly determine how good an approximationfor the null distribution of the test statistic is. This is equally true for many other statisticalprocedures – in general methods based on the normality assumptions work very well for distributionswith low γ and κ even if there are a small number of outliers. It thus seems desirable to have testsfor normality that are based on estimates of skewness and kurtosis. On a side note, we mentionthat the influence of different shapes of distribution with the same first 4 moments on robustnesshas been discussed for sequential t-tests by Nurnberg and Rasch (see ref. [28]).

3. Classes of robust tests for normality

3.1. Location functionals

The reason so many classical procedures are non-robust to outliers is often that the parametersof the model are expressed in terms of moments, and their classical estimators are expressed interms of sample moments, which are very sensitive to outliers. Another approach to robustnessis to concentrate on the parameters of interest suggested by the problem under study. It maywell turn out that these parameters can be expressed as functions of the underlying distributionindependently of a particular parametric model; that is as descriptive measures. If these descriptivemeasures are judiciously chosen, their naturally induced estimators are robust to aberration in thedata. This approach was introduced by P. E. Bickel and E. L. Lehmann in a series of papers (see e.g.ref. [4]). The test statistics proposed in this paper rely on estimates of central moments, inspired bythe Bickel-Lehmann approach. Next, we give a short technical description of the functions involvedin these estimates.

To estimate central moments, we need a location estimator. We will use four different lo-cation estimators: the mean T(0) = 1

n

∑ni=1 Xi, the median T(1), the trimmed mean T(2)(s) =

1n−2s

∑n−si=s+1 Xi:n, where X1:n < X2:n < · · · < Xn:n are the order statistics, and the Lehmann-

Hodges pseudo-median T(3) = mediani≤j(Xi+Xj)/2, i.e. the median of the set {(X1+X1)/2, (X1+

4

X2)/2, (X1 +X3)/2, . . . , (X1 +Xn)/2, (X2 +X2)/2, (X2 +X3)/2, . . . , (X2 +Xn)/2), . . . , (Xn−1 +Xn)/2, (Xn+Xn)/2}. All four of these are location functionals in the sense of Bickel and Lehmann(see ref. [4]). Furthermore, we will also use the variance functional construction given by Bickeland Lehmann, defined in analog with the location functionals.

To estimate the j:th central moment of the random variable X, we will use

Mj(r, T (Fn, s)) =1

n− 2r

n−r∑m=1+r

φj

(Xm:n − T (Fn, s)

),

where T (Fn, s) is a location functional applied to the sample X1, X2, . . . , Xn, φj is a tractable and

continuous function where φ0(x) =√π/2|x| and φj(x) = xj for j ∈ {1, 2, 3, 4}.

In the following, we will typically use 5 % trimming, i.e. r = s = 0.05n, so that the smallest 5 %as well as the largest 5 % of the observations are trimmed. Naturally, this will not be enough if thesample is contamined by more than 5 % outliers in either tail, and other trimming constants cancertainly be used. We believe that 5 % trimming is enough and that this provides a reasonable degreeof robustness: statistical procedures based on the normality assumption are unlikely to perform wellon heavily contamined samples and using too much trimming can therefore be problematic. Choiceof 5 % trimming is providing trade-off between robustness and power.

3.2. Robustified Jarque-Bera tests

Let

γ =1n

∑ni=1(xi − x)3(

1n

∑ni=1(xi − x)2

)3/2 , κ =1n

∑ni=1(xi − x)4(

1n

∑ni=1(xi − x)2

)2 − 3,

be the sample skewness and sample kurtosis, respectively. Therefore, the classical Jarque-Bera test(see ref. [18]) is based on the statistic

JB =n

6γ2 +

n

24κ2

that is, on a weighted sum of the squared sample skewness and sample kurtosis. It is therefore inline with the discussion in Section 2.3 about what constitutes a reasonable test for normality.

The general class RT of robustified Jarque-Bera tests was introduced in ref. [33]. It has form

RT =n

C1

(Mα1

j1(r1, T(i1)(Fn, s1))

Mα2j2

(r2, T(i2)(Fn, s2))−K1

)2

+n

C2

(Mα3

j3(r3, T(i3)(Fn, s3))

Mα4j4

(r4, T(i4)(Fn, s4))−K2

)2

. (1)

Proposition 1. As can be seen from (1) there exist a vast amount of RT class tests, which wecan obtain by choosing different values of ri, T(i)(Fn, si), etc. Several such tests have already beenstudied in the literature. These include i) the classical Jarque-Bera (JB) test, ii) the Urzua-Jarque-Bera (JBU) test, iii) the robust Jarque-Bera (RJB) test, iv) the Geary (a) test, v) the Uthoff (U)test, vi) the skewness

√b1 test, vii) the kurtosis b2 test and viii) the SJ test. See Appendix 7 for

more details about these RT class tests.

5

The choice of appropriate constants C1 and C2 is the hardest aspect of using tests belongingto the RT class. To obtain the constants C1 and C2 we need to find expressions for E(Mk

n1,n2) for

a finite sample size. Such calculations are very tedious and therefore we obtained these constantsfrom Monte Carlo simulations (see ref. [33], Tables 1 and 2 for constant trimming of selected mean-median, mean-trimmed mean and trimm-trimm tests, for other tests constants are given in thispaper). Notice, that the critical constant (for small and mid samples) under trimming (r > 0) aredifferent from critical constants without trimming (r = 0), since only the asymptotic distributionis normal in this case (see ref. [34]).

The classical Jarque-Bera test is not robust: it has a 0-breakdown point (see ref. [7]). However,there are many robust tests in the more general RT class, since these are based on the robust Bickel-Lehmann construction of location (ref. [4]). The 2nd “version” of robust estimation is trimmingof location (trimmed-mean tests) and moments (trim-trim tests). For trimming of location we usethe trimming constant s with trimming s = 0.05n. For trimming of moments we use the trimmingconstant r with trimming r = 0.05n. For combination of trimming of location and moments we usethe trimming constants r and s, where s determines the trimming in the location estimator T (Fn)and r determines the trimming in Mj(T, r).

Apart from tests with increased robustness, the RT class also contains tests with higher powerthan the classical Jarque-Bera test. In particular, mean-median types of introduced tests havemuch better power than classical Jarque-Bera test, especially tests which have medians in theirnominators.

Focusing on the RT class has its own reasons: we would like to find a small (tractable) groupof robustified JB-type tests, which substantially improve the properties of classical JB test (e.g.in terms of power against a certain class of alternatives), and which is still tractable to use bypractitioners. For that reason we conducted a power comparison on 46 ordered variants of RTJB

subclass. By ordered we mean tests that combine the most efficient mean with robust locationestimators (median and trimmed mean) and robust moments trim-trim estimators. We comparedthe tests against a wide range of alternatives for several sample sizes.

Based on this pilot simulation study, we decided that the following RT class tests merited furtherinvestigation:

The mean-median MMRT1 test statistic (suitable for testing of normality against heavy- andlight-tailed asymmetric alternatives), which has the following form:

MMRT1 =n

18

(M3(0, T(1)(Fn, 0))

M3/22 (0, T(0)(Fn, 0))

)2

+n

24

(M4(0, T(0)(Fn, 0))

M22 (0, T(0)(Fn, 0))

− 3

)2

.

The mean-median MMRT2 test statistic (suitable for testing of normality against heavy- andlight-tailed asymmetric alternatives and against bimodal alternatives), which has the form:

MMRT2 =n

18

(M3(0, T(1)(Fn, 0))

M3/22 (0, T(1)(Fn, 0))

)2

+n

24

(M4(0, T(0)(Fn, 0))

M22 (0, T(1)(Fn, 0))

− 3

)2

.

The trim-trim TTRT1 test statistic with trimming s = r = 0.05n (suitable for testing ofnormality against short-tailed symmetric alternatives), which is more robust than the other testsand has the following form:

TTRT1 =4n

5

(M3(0.05n, T(2)(Fn, 0.05n))

M3/22 (0, T(0)(Fn, 0))

)2

+27n

20

(M4(0.05n, T(2)(Fn, 0.05n))

M22 (0, T(0)(Fn, 0))

− 0.85

)2

.

6

The trim-trim TTRT2 test statistic with trimming s = r = 0.05n (note that this test is suitablefor testing of normality against heavy-tailed symmetric alternatives), which has the following form:

TTRT2 =16n

5

(M3(0.05n, T(2)(Fn, 0.05n))

M3/22 (0.05n, T(2)(Fn, 0.05n))

)2

+n

550

(M4(0, T(0)(Fn, 0))

M22 (0.05n, T(2)(Fn, 0.05n))

− 7.73

)2

.

3.3. Robustified Lin-Mudholkar tests

A well-known characterization of the normal distribution is that the sample mean X and samplevariance S2 are independent if and only if the underlying population is normal. Similarly, X andn−1

∑ni=1(Xi − X)3 are independent if and only if X is normal [20, Sections 4.2 and 4.7].

[23] proposed a test for normality based on the independence of X and S2. They used a jackknifeprocedure to estimate the correlation coefficient ρ(X, S2). The test, denoted Z2 is directed againstskew alternatives. [26] proposed a test Z3 directed against heavy-tailed alternatives based on theindependence of X and n−1

∑ni=1(Xi − X)3, constructed using the same jackknife procedure.

[36, 37] derived the corresponding bootstrap estimators of these correlations. With γ and κ asin the previous section, let

λ =1n

∑ni=1(xi − x)6(

1n

∑ni=1(xi − x)2

)3 − 15κ− 10γ2 − 15.

The bootstrap estimators can be computed directly without resampling:

Z∗2 =

γ√κ+ 3− n−3

n−1

, (2)

Z∗3 =

κ√λ+ 9 n

n−1 (κ+ γ2) + 6n2

(n−1)(n−2)

. (3)

In the simulation study below, Z∗2 is denoted BLMγ and Z∗

3 is denoted BLMκ.Note that the estimators are smooth functions of the sample skewness and kurtosis (and the

sixth sample cumulant). They are therefore of the form that was deemed desirable in Section 2.3.The Z∗

2 statistic is a weighted version of the classic skewness test statistic, where we also takekurtosis into account.

The class RLM of robustified Lin-Mudholkar tests Z2,R and Z3,R is obtained by replacing themoment estimators in (2) and (3) by a corresponding estimator Mj , using

γR =M3(r1, T1(Fn, s1))

Mak (r2, T2(Fn, s2))

, κR =M4(r3, T3(Fn, s3))

M bℓ (r4, T4(Fn, s4))

− 3,

λR =M6(r5, T5(Fn, s5))

M cm(r6, T6(Fn, s6))

− 15κR − 10γ2R − 15,

where the constants a, b and c depend on whether k, ℓ and m are 0 or 2.This class of tests can become numerically unstable, for instance if M4/M

22 − n−3

n−1 < 0, inwhich case the robustified Z∗

2 statistic is imaginary. Since the kurtosis of X is bounded by −2,we propose using kurtosis estimators of the type max(−2,M4/M

22 − 3) or max(−2,M4/M

40 − 3) to

7

avoid numerical issues. The numerical stability issues are even worse for the robustified Z∗3 – in

fact we were not able to find a numerically stable robust test in this class.We conducted a large simulation study to compare robustified versions of the Z∗

2 test. Basedon these results, we decided to include the following test statistic in the comparison in Section 4:

RLMγ =γR√

κR + 3− n−3n−1

with

γR =M3(0.05n, T(0)(Fn, 0))

M3/22 (0.05n, T(2)(Fn, 0))

and

κR = max(− 2,

M4(0.05n, T(3)(Fn, 0))

M40 (0.05n, T(0)(Fn, 0))

− 3).

Asymptotic properties and invariance of the RLM tests are discussed in the appendix.

3.4. Adaptive procedures

A natural alternative way to perform a (hopefully) robust normality test is to first manuallyremove outliers and then use a standard test for normality. In the simulation study in the nextsection, we consider four such adaptive tests, based on the Jarque-Bera and Shapiro-Wilk tests. Weapply the standard version of these tests to the data after either of two outlier removal procedureshas been used. The first procedure is to trim the data: the smallest 5 % of the observations areremoved along with the largest 5 %. We denote these tests JBtrimmed and SWtrimmed, respectively.The second procedure is to remove those points that are more than 1.5 box-lengths away from thebox in a boxplot. We denote these tests JBboxplot and SWboxplot.

4. Illustrative examples and power comparisons

4.1. Tests used in the comparison

In this section we try to find robust tests for normality. We do so by comparing the performanceof many different tests under a wide range of outlier models, to find a few tests with good robustnessproperties in many different situations. We are looking for tests that have low power against outlier-contamined normal distributions but high power against other distributions.

There is no uniformly best tests for normality that can be recommended for use in all situations.We therefore compare the tests against different types of alternatives/distributions, to find out whichtest is preferable in different situations. While it is rare in practice that the alternative distributionis known exactly, it is not rare that it is known that for instance skew distributions can arise orthat skew distributions could affect the outcome of the statistical procedure (e.g. the t-test). Insuch cases certain tests for normality are preferable, as is evident below. If no such prior knowledgeexists, choosing which test for normality to use is more difficult. We discuss further details of theseissues in section 6.1 of the paper.

To compare the robustness and power of several tests for normality, we conducted a large MonteCarlo simulation study, using 100,000 simulated samples for each alternative and α = 0.05.

Apart from the tests presented in the previous sections, i.e. the RT class tests, the BLMγ ,BLMκ and RLMγ tests and the adaptive tests, we included several competing tests for normality.

8

These were included either because they are popular tests that have exhibited high power in previoussimulation studies or because they have been proposed as robust tests for normality. The competingtests are the Jarque-Bera test (JB) introduced by ref. [18], the Jarque-Bera-Urzua test (JBU)introduced by ref. [38], the robust Jarque-Bera test (RJB) introduced by ref. [11], the Anderson-Darling test (AD) introduced by ref. [2], the D‘Agostino test (DT ) introduced by ref. [9], theLilliefors test (LT ) introduce by ref. [22], the medcouple test (MCLR) introduced by ref. [7], thedirected SJ test (SJdir) introduced by ref. [12] and the Shapiro-Wilk test (SW ) introduced by ref.[31].

4.2. Robustness

First, we compare the robustness of several tests for normality. We call a test robust if it isinsensitive to small deviations from normality in the form of outliers. For such tests, the poweragainst contaminate normal distributions should be close to the type I error rate α. We call a testnon-robust if it has power much above α against mildly contamined normal distributions.

We considered the mixture distribution (1 − p)N(µ1, σ21) + pN(µ2, σ

22) as a model for outliers.

A location-contamined standard normal distribution, hereon termed LoConN(p;µ), consisting ofrandomly selected observation with probability (1− p) drawn from a standard normal distributionand with probability p drawn from a normal distribution with mean µ and standard deviation σ = 1.The selected case is LoConN(0.05; 3), e.g. right contamination. Note that, by symmetry, the powerof the tests is the same against alternatives with left contamination. Similarly, we also use central(scale) contamination CN(p;µ = 0;σ2) where σ2 is close to zero; we chose CN(0.05; 0; 0.05). Theresults are presented in Tab. 1 below.

Table 1: Power of selected tests against contamined normal distributions for n = 20, 100

n = 20 n = 100test right central right centralAD 0.178 0.051 0.538 0.081DT 0.310 0.060 0.716 0.062JB 0.318 0.064 0.744 0.080JBU 0.325 0.064 0.742 0.082LT 0.124 0.054 0.346 0.096

MCLR 0.052 0.042 0.076 0.041RJB 0.308 0.066 0.748 0.107SJdir 0.247 0.074 0.547 0.174SW 0.235 0.052 0.685 0.066

MMRT1 0.277 0.053 0.686 0.065MMRT2 0.270 0.050 0.679 0.062TTRT1 0.040 0.058 0.190 0.060TTRT2 0.281 0.065 0.633 0.087BLMγ 0.259 0.053 0.716 0.055RLMγ 0.060 0.042 0.099 0.039BLMκ 0.209 0.073 0.604 0.091

JBboxplot 0.097 0.056 0.110 0.036JBtrimmed 0.135 0.061 0.070 0.025SWboxplot 0.033 0.022 0.055 0.022SWtrimmed 0.093 0.051 0.292 0.262

9

The MCLR, TTRT1, RLMγ and adaptive tests are robust against non-central outliers. Alltests are more or less robust against central outliers.

The adaptive tests all have some degree of robustness. JBtrimmed is robust against right con-tamination when n = 100 but not when n = 20. Moreover, like JBboxplot, it is too conservativeagainst central contamination when n = 100. JBtrimmed is somewhat robust when n = 20 but notwhen n = 100. SWboxplot is too conservative against right contamination when n = 20 and tooconservative against central contamination both when n = 20 and when n = 100.

We investigated the robustness of all tests mentioned above for several other outlier models,including models with a fixed number of outliers and models with scale outliers. The results do notdiffer qualitatively from those in Tab. 1 and are for the sake of brevity therefore not reported here.The MCLR, TTRT1 and RLMγ tests are robust against many different types of outliers, while theother tests generally are non-robust.

4.3. Power against symmetric heavy-tailed alternatives, symmetric short-tailed alternatives andasymmetric alternatives

In this section we present the power of selected normality tests against symmetric heavy-tailedalternatives, symmetric short-tailed alternatives and asymmetric alternatives. In Tab. 2 and 3 thepowers against symmetric heavy-tailed alternatives like Cauchy, Laplace, t and logistic distributionsare shown for n = 20 and n = 100, respectively.

Table 2: Power against symmetric heavy-tailed alternatives for n = 20

test Cauchy Laplace t3 t5 t7 logisticAD 0.880 0.276 0.327 0.172 0.116 0.107DT 0.844 0.290 0.371 0.223 0.160 0.142JB 0.857 0.307 0.385 0.233 0.168 0.148JBU 0.865 0.320 0.395 0.238 0.170 0.152LT 0.843 0.232 0.266 0.139 0.093 0.083

MCLR 0.180 0.058 0.054 0.051 0.052 0.049RJB 0.899 0.357 0.407 0.241 0.169 0.153SJdir 0.914 0.391 0.403 0.231 0.160 0.144SW 0.864 0.260 0.339 0.186 0.133 0.116

MMRT1 0.846 0.268 0.355 0.208 0.148 0.130MMRT2 0.843 0.263 0.350 0.204 0.142 0.126TTRT1 0.127 0.099 0.088 0.068 0.064 0.063TTRT2 0.893 0.345 0.404 0.235 0.165 0.151BLMγ 0.370 0.142 0.182 0.122 0.096 0.095RLMγ 0.389 0.247 0.279 0.229 0.206 0.045BLMκ 0.891 0.350 0.416 0.246 0.175 0.163

JBboxplot 0.259 0.109 0.104 0.081 0.070 0.071JBtrimmed 0.646 0.204 0.197 0.119 0.094 0.093SWboxplot 0.095 0.036 0.032 0.027 0.024 0.025SWtrimmed 0.550 0.125 0.123 0.077 0.064 0.064

Table 3: Power against symmetric heavy-tailed alternatives for n = 100

10

test Cauchy Laplace t3 t5 t7 logisticAD 1.000 0.824 0.851 0.483 0.283 0.241DT 1.000 0.725 0.867 0.587 0.394 0.332JB 1.000 0.802 0.902 0.651 0.454 0.399JBU 1.000 0.814 0.907 0.661 0.464 0.409LT 1.000 0.699 0.725 0.329 0.173 0.154

MCLR 0.654 0.130 0.086 0.058 0.054 0.050RJB 1.000 0.889 0.924 0.678 0.477 0.424SJdir 1.000 0.942 0.930 0.659 0.445 0.419SW 1.000 0.796 0.874 0.569 0.363 0.301

MMRT1 1.000 0.790 0.898 0.631 0.436 0.375MMRT2 1.000 0.787 0.895 0.626 0.429 0.368TTRT1 0.999 0.341 0.660 0.306 0.156 0.107TTRT2 1.000 0.862 0.924 0.667 0.468 0.413BLMγ 0.444 0.171 0.281 0.186 0.137 0.115RLMγ 0.318 0.314 0.383 0.327 0.289 0.039BLMκ 1.000 0.891 0.938 0.703 0.510 0.454

JBboxplot 0.301 0.050 0.053 0.041 0.038 0.038JBtrimmed 0.825 0.080 0.066 0.031 0.028 0.027SWboxplot 0.236 0.063 0.031 0.023 0.019 0.021SWtrimmed 0.858 0.180 0.150 0.178 0.211 0.205

From the simulation study we can conclude that SJdir, RJB, TTRT2 and BLMκ are thebest tests against symmetric heavy-tailed alternatives like Cauchy and t5. Note that for the mostheavy-tailed alternative like Cauchy, SJdir is the best. On the other hand, for moderately heavy-tailed alternative like t5, BLMκ and RJB test are the best. However, neither of these tests hasacceptable robustness properties. Among the robust tests, RLMγ is preferable against this class ofalternatives.

The JBboxplot and JBtrimmed tests actually lose power as n increases from 20 to 100, indicatingthat they are unsuitable for use with larger sample sizes.

Tab. 4 presents power against symmetric short-tailed alternatives like the beta and uniformdistributions, for n = 20 and n = 100, respectively.

Table 4: Power against symmetric short-tailed alternatives for n = 20 and n = 100

11

n = 20 n = 100beta beta beta uniform beta beta beta uniform

test (0.5,0.5) (2,2) (4,4) (0.5,0.5) (2,2) (4,4)AD 0.619 0.054 0.042 0.172 1.000 0.326 0.088 0.954DT 0.479 0.030 0.019 0.132 1.000 0.639 0.139 0.997JB 0.006 0.004 0.011 0.004 1.000 0.050 0.004 0.756JBU 0.003 0.003 0.010 0.002 1.000 0.016 0.002 0.571LT 0.326 0.051 0.042 0.101 0.994 0.142 0.066 0.580

MCLR 0.461 0.070 0.056 0.137 0.928 0.095 0.064 0.312RJB 0.004 0.005 0.012 0.003 0.628 0.000 0.001 0.008SJdir 0.002 0.007 0.018 0.002 0.000 0.000 0.002 0.000SW 0.726 0.053 0.036 0.198 1.000 0.458 0.091 0.997

MMRT1 0.186 0.025 0.024 0.059 1.000 0.142 0.031 0.816MMRT2 0.275 0.040 0.031 0.099 1.000 0.200 0.049 0.860TTRT1 0.132 0.046 0.045 0.064 0.986 0.320 0.119 0.827TTRT2 0.015 0.006 0.015 0.006 0.743 0.026 0.014 0.117BLMγ 0.092 0.028 0.031 0.042 0.099 0.018 0.022 0.038RLMγ 0.139 0.095 0.0110 0.103 0.160 0.110 0.131 0.122BLMκ 0.805 0.208 0.106 0.499 1.000 0.898 0.365 1.000

JBboxplot 0.522 0.041 0.038 0.126 1.000 0.652 0.179 0.997JBtrimmed 0.131 0.025 0.033 0.031 0.998 0.265 0.119 0.738SWboxplot 0.721 0.049 0.029 0.196 1.000 0.453 0.087 0.996SWtrimmed 0.551 0.074 0.058 0.169 1.000 0.792 0.571 0.988

Against short-tailed alternatives like beta distributions, BLMκ is by far the most powerful test.The SW , AD, DT , MMRT2, TTRT1 test all have reasonably high power - only the last one ofthese tests has good robustness properties.

Tab. 5 and 6 present power against asymmetric alternatives like the lognormal, exponential andWeibull distributions, for n = 20 and n = 100, respectively.

Table 5: Power against asymmetric alternatives for n = 20

12

lnorm exp gamma Burr Weibulltest (2,1) (2,2,1) (2,1)AD 0.902 0.780 0.462 0.574 0.123DT 0.783 0.590 0.380 0.528 0.118JB 0.820 0.636 0.408 0.553 0.120JBU 0.771 0.570 0.363 0.515 0.108LT 0.791 0.597 0.324 0.443 0.101

MCLR 0.492 0.398 0.173 0.177 0.076RJB 0.782 0.571 0.357 0.508 0.102SJdir 0.684 0.428 0.244 0.400 0.063SW 0.930 0.842 0.527 0.626 0.147

MMRT1 0.878 0.729 0.458 0.585 0.126MMRT2 0.870 0.721 0.445 0.568 0.126TTRT1 0.370 0.360 0.244 0.237 0.101TTRT2 0.770 0.550 0.326 0.480 0.085BLMγ 0.960 0.887 0.676 0.758 0.268RLMγ 0.985 0.954 0.838 0.281 0.495BLMκ 0.660 0.416 0.261 0.413 0.076

JBboxplot 0.723 0.653 0.372 0.369 0.123JBtrimmed 0.749 0.563 0.323 0.408 0.102SWboxplot 0.607 0.564 0.243 0.221 0.068SWtrimmed 0.777 0.623 0.325 0.379 0.108

Table 6: Power against asymmetric alternatives for n = 100

lnorm exp gamma Burr Weibulltest (2,1) (2,2,1) (2,1)AD 1.000 1.000 0.998 0.999 0.611DT 1.000 1.000 0.991 0.998 0.532JB 1.000 1.000 0.995 0.999 0.567JBU 1.000 1.000 0.990 0.998 0.508LT 1.000 1.000 0.953 0.983 0.383

MCLR 0.988 0.957 0.632 0.647 0.179RJB 1.000 1.000 0.978 0.996 0.441SJdir 0.998 0.909 0.613 0.895 0.071SW 1.000 1.000 1.000 1.000 0.790

MMRT1 1.000 1.000 0.989 0.998 0.535MMRT2 1.000 1.000 0.989 0.998 0.545TTRT1 1.000 0.996 0.877 0.949 0.366TTRT2 1.000 0.998 0.950 0.988 0.319BLMγ 1.000 1.000 1.000 1.000 0.870RLMγ 1.000 1.000 1.000 0.707 0.973BLMκ 0.998 0.925 0.713 0.928 0.126

JBboxplot 1.000 1.000 0.993 0.984 0.724JBtrimmed 1.000 0.997 0.860 0.885 0.269SWboxplot 1.000 1.000 0.986 0.962 0.596SWtrimmed 1.000 1.000 0.989 0.986 0.758

13

For asymmetric alternatives like the Weibull and exponential distributions, the RLMγ is out-standing, although SW , AD, MMRT1, MMRT2, BLMγ , JBboxplot and SWtrimmed all performwell. Among the test with good performance, RLMγ has the best robustness properties.

4.4. Catalytic isomerization: influence of truncation

The analysis of data from factorial-type designs, often present in chemometrics, typically in-cludes fitting one or more polynomial models and carrying out testing on these models, in particularcomparing models of different order. For a completely randomized design, in general, we assumethat an experiment has been run with L treatments defined by combinations of the levels of thefactors and that the responses can be modelled as in response surface modelling (see e.g. ref. [6])as

Yi,j = µi + ϵi,j , i = 1, .., L, j = 1, .., ni, (4)

where Yi,j is the response from jth replicate of treatment i, µi is the expected response fromtreatment i, E(ϵ) = 0, V (ϵ) = σ2I. We shall refer to this as the full treatment model, which isconsidered in the prevalent literature with normal distributions. In several chemical applications,truncated normal distributions may arise, which can have a severe impact on statistical decisionsif normality is enforced. For illustration, let us consider kinetics of catalytic isomerization of n-Pentane. [8] introduced tests for importance of higher factors (later discussed in ref. [15]). In ref.[32] it was pointed out that we should be aware of the fact, that the values yi, which are realizationsof y = γK2/(1 +

∑Kix

′i), are not only positive but strictly greater than some positive constant

(here Ki are adsorption equilibrium constants, xi are partial pressures, γ is a constant dependenton catalyst and temperature T and i = 1, 2, 3 are indices for hydrogen, n-pentane and isopentane,respectively). To see this, look at their measured values in Table III of ref. [8] or consider theirchemical meaning. Thus alongside the outlier problem discussed in ref. [27], there is a further issueconcerning the distributional deviations from the F -distribution. The F distribution appears hereas the distribution of the ratio of two mean squares, mean square of lack of fit and mean square ofpure error, which is usually the most convenient way to compute a test statistic for the importanceof higher factors in response surface model (see e.g. ref. [6]).

[15] wrote that “if we use s2 (the test of 2nd order parameters gives a test statistics of F = 2.87,which on 6 and 3 degrees of freedom gives a p-value of 0.208, which would suggest that there islittle justification for further interpreting the 2nd order model.” However, already for truncation atconstant C = 2, the p-value for an F-test of second order parameters is 0 for the above mentionedF = 2.87 on 6 and 3 degrees of freedom (and the minimal observed value of yi in Table III of ref.[8] is 16.258). Here we consider F-test statistics based on the pure error estimate obtained fromfitting model (4). Figure 2 shows the p-values for various truncation constants C > 0, concretelyfor C = 1 and C = 2. Truncation (censoring) constant C is given by the positive nature of thechemical variable y, which follows a truncated normal distribution. Truncation increases statisticalsignificance of higher order factors significantly. It is clear that more caution should be taken whendistinguishing between normality and the truncated normal distribution. Figures 3 and 4 showthat the TTRT1 and BLMκ test serve as a powerful test for testing against truncated normaldistributions.

5. Graphical methods

There are two main approaches to assessing normality: testing procedures and graphical meth-ods. Ideally, both approaches should be used, to give a more complete understanding of whether

14

0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3 3.5 4

p-va

lue

test statistic

F(6,8)

FF-trunc c=1F-trunc c=2

Figure 2: p-values.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.5 1 1.5 2 2.5 3

pow

er

truncation point

Power against truncated normal distribution (n=20)

JB SW

TTRT1 BLMk

Figure 3: Power of selected tests against truncated normal distribution for n = 20.

15

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.5 1 1.5 2 2.5 3

pow

er

truncation point

Power against truncated normal distribution (n=100)

JBSW

TTRT1BLMk

Figure 4: Power of selected tests against truncated normal distribution for n = 100.

the normality assumption is valid. Commonly used graphical methods for assessing normality arehistograms, boxplots, probability-probability (PP) plots and quantile-quantile (QQ) plots, amongothers, the last of these being the most popular. The QQ plot is a graphical method for compar-ing two probability distributions by plotting their quantiles against each other. The QQ plot isgenerally a more powerful approach to comparing two probability distributions than the commontechnique of comparing histograms of the two samples (see ref. [41]). When constructing a normalQQ plot, the observed data are usually standardized, sorted in increasing order, and finally plottedagainst the quantiles of standard normal distribution. If the data comes from a normal distribution,then the ordered standardized observations should lie closely along the 45 degree (x = y) line.

One of the main disadvantages of the classical graphical approaches is high sensitivity to outliers(see ref. [24]). Hence, Gel et al. (2005) [13] proposed a modified version of the QQ plot that consistsof a robust standardization of the observed data. Authors used the median as robust measure oflocation, and the median absolute deviation (MAD) or average absolute deviation from a samplemedian (MAAD or J) as robust measure of scale, because these measures are significantly lesssensitive to outliers than the mean and standard deviation (ref. [13], [17]). Authors called thismodified QQ plot a robust QQ plot, or simply an RQQ plot. Their results show that in manypractical situations the RQQ plots provide clearer insight into the possible causes of non-normalitythan the usual QQ plots.

Based on our approach to testing normality we can modify the classical version of the QQ plotby using trimmed mean as robust measure of location, and trim-trim standard deviation as robustmeasure of scale. The trimmed mean is one of the location functionals relevant to this paper. Thetrim-trim standard deviation is a special case of moments Mj(r, T (Fn, s)) generally defined above.For our modified QQ plot Mj(r, T (Fn, s)) is concretely defined as

M(r, T(2)(s)) =1

n− 2r

n−r∑m=1+r

(Xm:n − T(2)(s)

)1/2, (5)

16

where s = r = 0.05n. We call this modified QQ plot an RT-QQ plot.Now we illustrate some advantages of the RT-QQ plot. For purpose of comparison with the

RQQ plot we use the p-location outlier models (see ref. [3]), which simulate presence of outliers indatasets.

Figure 5 illustrates the QQ, RQQ and RT-QQ plots with simulated data with presence ofone outlier and small sample size n = 20. Therefore, in this case we consider the p-locationoutlier models introduced by [3], i.e. we simulate the p-location outlier model, where we considerX1, .., Xn−p to be i.i.d. (independent and identically distributed) from N(0, 1) and Xn−p+1, .., Xn

to be i.i.d. from N(λ, 1), where n = 20, p = 1 and λ = 5. All illustrated QQ plots show one outlierin the right tail. Moreover, the usual normal QQ plot and RQQ plots show a possible outlier in theleft tail. In contrast, only in the RT-QQ plot the standardized observations corresponds with 45degrees line in the left tail. Similarly, the presence of the right tail outlier is better shown in theRT-QQ where this outlier is farther from the 45 degrees line.

Figure 6 illustrates the QQ, RQQ and RT-QQ plots with symmetric outlier model and largesample size n = 100, where two outliers on each tail were simulated. Therefore, here we considerthe outlier model where X1, .., Xp/2 to be i.i.d. from N(λ1, 1), Xp/2+1, .., Xn−p/2 to be i.i.d. fromN(0, 1) and Xn−p/2+1, .., Xn to be i.i.d. from N(λ2, 1), where n = 100, p = 4, λ1 = −5 and λ2 = 5.As it is shown in 6, all illustrated QQ plots show two outliers in both tails. However, the outliers arefarther from the 45 degrees line in the RT-QQ plot, which not only makes them easier to detect, butalso makes it more clear that these observations should be disregarded when assessing the normalityof the central parts of the distribution.

6. Discussion and conclusions

6.1. Omnibus and directed tests

A test for normality is directed against a class of alternatives if it is sensitive to alternativesfrom that class, but not sensitive to alternatives from other classes. Typical examples are tests thatare directed towards skew or kurtosis alternatives, including the robustified Lin–Mudholkar tests.Omnibus tests, including the Shapiro-Wilk and robustified Jarque-Bera tests, are directed againstall alternatives.

If the experimenter knows that only some types of non-normality can occur or are of concern fora particular inference procedure, it is arguably preferable to use a directed test for normality. Asan example, Student’s t-test is sensitive to skewness but relatively robust against heavy tails, andthus it is reasonable to use a test for normality that is directed towards skew alternatives beforeapplying the t-test. Using a directed test for normality has the benefit of getting higher poweragainst “dangerous” alternatives and lower power against alternatives that are less ”dangerous”,meaning that we are less likely to reject normality because of deviations from normality that won’taffect the performance of our inferential procedure.

On the other hand, it can sometimes be difficult to know which type of non-normality that isof concern. Often, such knowledge is based on detailed analysis of a test or a statistical method. Arough rule of thumb states that inference about means is sensitive to skewness and inference aboutvariances is sensitive to kurtosis, but this is not always valid. When the experimenter is unsureabout which type of non-normality to worry about, it is better to use an omnibus test (or perhapsmultiple directed tests).

17

−3 −2 −1 0 1 2 3

−2

02

46

Normal QQ plot, 45 degrees line

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3−

20

24

6

RQQ plot (median and MAD), 45 degrees line

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3

−2

02

46

RT−QQ plot, 45 degrees line

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3

−2

02

46

RQQ plot (median and MAAD), 45 degrees line

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 5: QQ (top left), RQQ (right) and RT-QQ (bottom left) plots of the simulated data with presence of oneoutlier and small sample size n = 20, standardized sample observations

18

−3 −2 −1 0 1 2 3

−10

−5

05

10

Normal QQ plot, 45 degrees line

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3−

10−

50

510

RQQ plot (median and MAD), 45 degrees line

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3

−10

−5

05

10

RT−QQ plot, 45 degrees line

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3

−10

−5

05

10

RQQ plot (median and MAAD), 45 degrees line

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 6: QQ (top left), RQQ (right) and RT-QQ (bottom left) plots of the simulated data with symmetric outliermodel and large sample size n = 100, standardized sample observations

19

6.2. Conclusions based on simulations

In Section 4 we presented a large simulation study, comparing both power and robustness ofseveral tests for normality. No unambiguous conclusions can be drawn, in the sense that no testcan be recommended above all others in all settings. This is an inherent problem when comparingtest for normality - there simply is no uniformly most powerful test for normality. As discussed inSection 4.1, it is however possible to make recommendations for different settings. Based on thesimulation results, we make the following recommendations.

For symmetric heavy-tailed alternatives, JBtrimmed and RLMγ are the best robust tests forsmall sample sizes (n = 20). For large sample sizes (n = 100), TTRT1 is preferable when a robusttest is desired.

When testing against symmetric short-tailed alternatives, SWboxplot is the best robust test.For testing against asymmetric alternatives, RLMγ can be recommended over all other tests,

regardless of whether robustness is of interest or not. It has good robustness properties and excellentpower against varying asymmetric distributions.

7. Appendix

Details regarding robustified Jarque–Bera tests

The tests mentioned in Proposition 1 as special cases of general RT class defined in (1) arethe following.

i) The classical Jarque-Bera test is a special case of RT class defined as

JB =n

6

(µ3

µ3/22

)2

+n

24

(µ4

µ22

− 3

)2

.

As pointed out by several authors (see for example ref. [38]) the classical JB test behaves wellin comparison with some other tests for normality if the alternatives belong to the Pearson family.However, the JB test behaves very badly for distributions with short tails and bimodal shape,sometimes it is even biased (see ref. [35]).

ii) Test of Urzua (see ref. [38]) is a special case of RT class test without trimming defined as

JBU =

(n+1)(n+3)n−2

6

(µ3

µ3/22

)2

+

(n+1)2(n+3)(n+5)n(n−2)(n−3)

24

(µ4

µ22

− 3

)2

.

iii) The robust Jarque-Bera test (see ref. [11]) is a special case of RT class test without trimmingdefined as

RJB =n

C1

(µ3

J3n

)2

+n

C2

(µ4

J4n

− 3

)2

.

iv) The Geary’s test a (see ref. [10]) is a special case of RT class test without trimming(originally denoted w′

n) is defined as

a =1

n

n∑i=1

|Xi − X|√m2

.

20

v) The Uthoff’s test U (see ref. [39]) is a special case of RT class test without trimming definedas

U =1

n

n∑i=1

|Xi −Mn|√m2

.

vi) The skewness test b1 is a special case of RT class test without trimming defined as

b1 =n

6

(µ3

µ3/22

)2

.

vii) The kurtosis test√b2 is a special case of RT class test without trimming defined as

√b2 =

n

24

(µ4

µ22

− 3

)2

.

viii) SJ test (see ref. [12]) is a special case of RT class test without trimming defined as

SJ =1

n

n∑i=1

√m2

|Xi −Mn|.

Properties of the robustified Lin–Mudholkar tests

As we wish to test the composite hypothesis of normality, that is, test for normality withoutspecifying the parameters µ and σ2, reasonable test statistics should be invariant under lineartransformations.

Theorem 1. The test statistics Z2,R and Z3,R are scale and location invariant.

Proof. Let X = (X1, . . . , Xn) be a vector of i.i.d. random variables. Then for (a, b) ∈ R2, T(k)(aX+b) = aT(k)(X)+b for k ∈ {0, 1, 2, 3}. Thus Mj(aX+b) = ajMj(X). The test statistics are therefore

invariant under shifts in location. Because the nominators and denominators of γR, κR and λR areproperly scaled, they are also invariant under scaling.

Next, we show that the null distributions of the proposed test statistics are asymptoticallynormal. We apply Fisher’s arctanh transformation as this results in faster convergence to theasymptotic distribution.

Theorem 2. Let X1, . . . , Xn be i.i.d. normal random variables. Then

√n arctanh(Z2,R) is asymptotically N

(0, V2

)and √

n arctanh(Z3,R) is asymptotically N(0, V3

),

where the constants V2 and V3 depend on the parameters r1, . . . , r6 and s1, . . . , s6.

21

Proof. It follows from ref. [34] that for r = ⌊βn⌋ with β ∈ [0, 0.5], Mj = (n− 2r)−1∑n

i=1(Xi − µ)j

is asymptotically normal with mean µj for j ∈ {3, 4}. But

(n− 2r)−1n∑

i=1

(Xi − T(k))j = (n− 2r)−1

n∑i=1

(Xi − µ+ µ− T(k))j

= (n− 2r)−1n∑

i=1

(Xi − µ)j + (n− 2r)−1Rn,

where (n−2r)−1Rn =∑

ℓ1,ℓ2(Xi−µ)ℓ1(µ−T(k))

ℓ2p−→ 0 since T(k) is consistent. By Slutsky’s lemma,

Mj is asymptotically normal for k ∈ {0, 1, 2, 3}. Using the same lemma again, Z2,R and Z3,R areasymptotically normal with mean 0. The result now follows by applying the delta method.

The asymptotic distributions are good approximations of the finite sample null distributionsof the test statistics even for moderate sample sizes. Usually n ≥ 50 is required to get a goodapproximation, but in some cases, such as the bootstrap statistics BLMγ and BLMκ, n = 15 issufficient to get a good approximation of the null distribution.

Corollary 1. Under the null hypothesis,

√n arctanh(BLMγ) is asymptotically N

(0,

5

2

)and √

n arctanh(BLMκ) is asymptotically N(0, 3)

For symmetric alternatives, the Mj estimators are consistent estimators of the correspondingpopulation central moments. Therefore the asymptotic mean of Z3,R is ρ3. Hence, the Z3,R testsare consistent against kurtotic alternatives. For asymmetric alternatives the asymptotic means ofthe test statistics are in general not possible to express as functions of ρ2 and ρ3, although they canbe computed explicitly for some choices of the parameters r1, . . . , r6 and s1, . . . , s6. The asymptoticmeans are however generally not 0, meaning that the Z2,R tests are consistent against most skewdistributions.

Similarly, the asymptotic variances are only possible to obtain for some choices of the parameters.Assuming that the necessary moments exist, the asymptotic distributions of Z2,R and Z3,R underalternative distributions are normal.

Acknowledgement

We thank the Editor and Reviewers, whose insightful comments helped us to improve the paperconsiderably.

References

[1] P.C. Alvarez-Esteban, E. del Barrio, J.A. Cuesta-Albertos, C. Matran, Assessing when a sampleis mostly normal, Computational Statistics & Data Analysis, 54 (2010) 2914–2925.

22

[2] T.W. Anderson, D.A. Darling, A test for goodness of fit, Journal of the American StatisticalAssociation, 49 (1954) 765–769.

[3] N. Balakrishnan, Permanents, Order Statistics, Outliers, and Robustness, Revista MatematicaComplutense, 20 (2007) 7–107.

[4] P.J. Bickel, E.L. Lehmann, Descriptive Statistics for Nonparametric Models II. Location, TheAnnals of Statistics, 3 (1975) 1045–1069.

[5] R.G. Brereton, Applied Chemometrics for Scientists, Wiley, 2007.

[6] G.E.P. Box, N.R. Draper, Response Surfaces, Mixtures, and Ridge Analyses, 2nd ed., Wiley,2007.

[7] G. Brys, M. Hubert, A. Struyf, Goodness-of-fit tests based on a robust measure of skewness,Computational Statistics, 23 (2008), 429–442.

[8] N.L. Carr, Kinetics of Catalytic Isomerization of n-Pentane, Ind. Eng. Chem., 52 (1960) 391–396.

[9] R. D’Agostino, E. Pearson, Tests for departures from normality. Empirical results for thedistribution of

√b1 and b2, Biometrika, 60 (1973) 613–622.

[10] R.C. Geary, The ratio of the mean deviation to the standard deviation as a test of normality,Biometrika, 27 (1935) 310–332.

[11] Y.R. Gel, J.L. Gastwirth, A robust modification of the Jarque Bera test of normality, Eco-nomics Letters, 99 (2008) 30–32.

[12] Y.R. Gel, W. Miao, J.L. Gastwirth, Robust directed tests of normality against heavy-tailedalternatives, Computational Statistics & Data Analysis, 51 (2007) 2734–2746.

[13] Y.R. Gel, W. Miao, J.L. Gastwirth, The importance of checking the assumption underlyingstatistical analysis: graphical methods for assesing normality, Jurimetrics, 45 (2005) 3–29.

[14] I. Gijbels, M. Hubert, Robust and Nonparametric Statistical Methods, in: S. Brown, R. Tauler,R. Walczak (Eds.), Comprehensive Chemometrics, Elsevier, Oxford, pp. 189–211.

[15] S.G. Gilmour, L.A. Trinca, Optimum design of experiments for statistical inference, Journalof the Royal Statistical Society: Series C (Applied Statistics), 61 (2012) 345–401.

[16] P. Hall, The Bootstrap and Edgeworth Expansion, Springer, 1992.

[17] W. Hui, Y.R. Gel, J.L. Gastwirth, lawstat: An R Package for Law, Public Policy and Biostatis-tics, Journal of Statistical Software, 28 (2008) 1–26.

[18] C. Jarque, A. Bera, Efficient tests for normality, heteroskedasticity and serial independence ofregression residuals: Monte Carlo evidence, Economics Letters, 7 (1981) 313–318.

[19] P. Jordanova, J. Dusek, M. Stehlık, Modeling methane emission by the infinite moving averageprocess, Chemometrics and Intelligent Laboratory Systems, 122 (2013) 40–49.

23

[20] A.M. Kagan, Y.V. Linnik, C.R. Rao, Characterization Problems in Mathematical Statistics,Wiley, 1973.

[21] T. Li, Y. Huang, W. Zhang, Ch. Song, CH4MODwetland: a biogeophysical model for simulat-ing methane emissions from natural wetlands, Ecological Modelling, 221 (2010) 666–680.

[22] H. Lilliefors, On the Kolmogorov-Smirnov test for normality with mean and variance unknown,Journal of the American Statistical Association, 62 (1967) 399–402.

[23] C.-C. Lin, G.S. Mudholkar, A simple test for normality against asymmetric alternatives,Biometrika, 67 (1980) 455–461.

[24] J.I. Marden, Positions and QQ Plots, Statistical Science, 19 (2004) 606–614.

[25] J. Miller, J. Miller, Statistics and Chemometrics for Analytical Chemistry, 5th ed., Prenticeand Hall, 2005.

[26] G.S. Mudholkar, C.E. Marchetti, C.T. Lin, Independence characterizations and testing nor-mality against restricted skewness-kurtosis alternatives, Journal of Statistical Planning andInference, 104 (2002) 485–501.

[27] W.G. Muller, M. Stehlık, Optimum design of experiments for statistical inference Discussion- Discussion on the paper by Steven G. Gilmour and Luzia A. Trinca, Journal of the RoyalStatistical Society: Series C (Applied Statistics), 61 (2012) 369–401.

[28] G. Nurnberg, D. Rasch. The influence of different shapes of distributions with the same firstfour moments on robustness, in D. Rasch, M. L. Tiku (Eds.), Robustness of statistical methodsand nonparametric statistics, 1983, pp. 83–84.

[29] J.M. Rodrıguez-Dıaz, T. Santos-Martın, M. Stehlık and H.Waldl, Filling and D-optimal designsfor the correlated generalized exponential models, Chemometrics and Intelligent LaboratorySystems, 114 (2012) 10–18.

[30] P.J. Rousseeuw, M. Debruyne, S. Engelen, M. Hubert, Robustness and Outlier Detection inChemometrics, Critical Reviews in Analytical Chemistry, 36 (2006) 221–242.

[31] S.S. Shapiro, M.B. Wilk, An analysis of variance test for normality, Biometrika, 52 (1965)591–611.

[32] M. Stehlık, L. Strelec, Optimum design of experiments for statistical inference Discussion -Discussion on the paper by Steven G. Gilmour and Luzia A. Trinca, Journal of the RoyalStatistical Society: Series C (Applied Statistics), 61 (2012) 369–401.

[33] M. Stehlık, Z. Fabian, L. Strelec, Small sample robust testing for Normality against Paretotails, Communications in Statistics - Simulation and Computation, 41 (2012) 1167–1194.

[34] S.M. Stigler, The Asymptotic Distribution of the Trimmed Mean, The Annals of Statistics, 1(1973) 472–477.

[35] T. Thadewald, H. Bunning, Jarque-Bera test and its competitors for testing normality - apower comparison, Journal of Applied Statistics, 34 (2007) 87–105.

24

[36] M. Thulin, On two simple tests for normality with high power, arXiv:1008.5319 (2010).

[37] M. Thulin, Tests for multivariate normality based on canonical correlations, arXiv:1108.2986(2011).

[38] C.M. Urzua, On the correct use of omnibus tests for normality, Economics Letters, 53 (1996)247–251.

[39] V.A. Uthoff, The most powerful scale and location invariant test of the normal versus thedouble exponential, The Annals of Statistics, 1 (1973) 170–174.

[40] G.L. Vaghjiani, A.R. Ravishankara, New measurement of the rate coefficient for the reactionof OH with methane, Nature, 350 (1991) 406–409.

[41] H.C. Thode, Testing for normality, Marcel Dekker, New York, 2002.

25