Some Package Development in R - Pakistan Research ...
-
Upload
khangminh22 -
Category
Documents
-
view
2 -
download
0
Transcript of Some Package Development in R - Pakistan Research ...
Addressing Linear Regression Models with Correlated
Regressors: Some Package Development in R
A thesis presented in candidature for the degree of Doctor of
Philosophy at the Bahauddin Zakariya University, Multan
SUBMITTED by
Muhammad Imdadullah
Roll No.: PHDS-11-08Session: 20112016
SUPERVISED by
Dr. Muhammad Aslam
Department of Statistics
Bahauddin Zakariya University Multan, Pakistan.
July, 2016
Chapter 1
Introduction
The past decades have been seen a great surge of activity in the general area of
regression model. Several factors have contributed to immense activity, not the least
of which is the penetration and extensive adoption of the computers and computing
software in statistical work.
The objective of multiple linear regression analysis is to estimate the relationship of
individual parameters of a dependency but not of interdependency by assuming that
the dependent variable y and independent variables X are linearly related to each
other (see Graybill, 1980; Johnston, 1963; Malinvaud, 1968). In regression, we try to
draw some inferences such as (i) identify the relative inuence of the regressors (ii)
prediction and/or estimation and (iii) selection of an appropriate set of variables for
the model. Similarly, one of the purpose of any regression model is to ascertain what
extent the dependent variable can be predicted by the independent variables (also
known as regressors predictors or explanatory variable in dierent application). For
this purpose R2 (the coecient of determination) is used to indicate the strength of
prediction i.e. goodness of t of regression model.
The tting of linear regression models by the ordinary least square (OLS) method
is the most widely used modeling procedure. For such model, a common but strong
1
assumption for classical linear regression models (CLRM) is that there should be no
linear relationship among regressors, i.e. there should be no collinearity. In other
words, regressors should be orthogonal, but in most of the application related to
regression analysis, regressors are not orthogonal which may lead to misleading or
erroneous inferences made from regression results, specially in case when regressors
are strongly and linearly correlated to each other in order to draw some suitable
inferences from regression analysis. This problem is also known as multicollinearity
and have adverse eects on the OLS estimates, making more dicult to draw some
suitable inferences from regression analysis and interpretation of regression equation
(see Ragnar, 1934; Mason et al., 1975; Gunst and Mason, 1977; Gunst, 1983; Hawking
and Pendleton, 1983).
Therefore, the interpretation of multiple regression model depends on the assumption
that regressors are not strongly or perfectly correlated and usually, the regression
coecient are interpreted as the change in dependent variable due to corresponding
regressor while keeping all other regressors as constant. However, this interpretation
does not remain valid when there is strong linear relationship among regressors or
when degree of multicollinearity is not ignorable, making impossible to estimate the
unique eects of individual variables in the regression model (Belsley, 1991; Belsley
et al., 1980; Hoerl and Kennard, 1970a,b). The estimated values of coecient from
regression model of (correlated regressors) are sensitive to a slight change in data,
even inclusion or exclusion of variable(s) or observation(s) in equation.
1.1 Background
In experimental designs, variable (regressors) that are orthogonal (non-collinear) to
each other and cause no problem can be created. However, collinear data that
usually arise and cause problems in dierent application of linear regression such as
econometrics, technology, geophysics, social science, nance, oceanography and
2
other elds that rely on non-experimental (observational) data. Multicollinearity is
considered as lack of sucient information in the sample data that foreclose to get
accurate estimation of individual regression parameters. Generally, multicollinearity
can be signalized if pairwise correlation among regressors is above 0.80, R2 is high
and signicant F -test of the model with non-signicant t-ratios of regression
coecients.
Problem of multicollinearity is extremely dicult to detect as it is not specication
error or modeling error that may be uncovered by, such as exploring the regression
residuals or by some other methods. Actually, multicollinearity is a condition of
decient data (Hadi and Chatterjee, 1988). Regressors in a model may be highly
collinear which is a fact of life and it makes dicult to infer the separate inuence
of collinear predictor variates on the response variate, because collinear regressors
do not render information which is very dierent from that already inherent in the
others, whereas data collection method used, constraints on the tted model, model
specication problem, overdened model and some common trend in time series data
may be some sources of multicollinearity (Belsley et al., 1980; Koutsoyiannis, 1977).
1.2 Motivation
Several methods of detection of multicollinearity are present in the existing
literature such as variance ination factor (VIF), high correlation among regressors,
high R2, condition number (CN) and condition index (CI) etc., among many others,
but there is no unique method of detection that measures or indicates the existence
of multicollinearity in data. In most of the statistical software such as SPSS, SAS,
STATA, R, NCSS and S-PLUS etc., multicollinearity detection techniques are
available, however, there is none of these software contains all possible or majority
of these existing collinearity diagnostic measures. Though some of the software
provides most widely used detection techniques such as VIF, R2, eigenvalues and
3
CN etc. Moreover, widely accepted methods for the detection of multicollinearity
such as VIF, CM and R2 etc., have no denite threshold value for the indication of
existence of multicollinearity. The values of indicators (detection methods) are not
comparable with each other, and their interpretation is also subjective sometimes.
For estimation of coecients when data is collinear, dierent biased regression
techniques are used that are available in the literature, such as the ridge regression
(RR) and Liu regression (LR), modied ridge regression (MRR) and principal
component regression (PCR) etc., but there are few software and packages that
provides the computation of these biased regression methods. Most of the available
software provide estimation of coecient using ridge regression such as SAS,
Statsgraphics, NCSS and some R packages such as "ridge", "bigRR" and "glmnet"
etc. There is no statistical software or R package that provides LR method except
an R package named "lrmest" as discussed by (Liu, 1993). Similarly, there is no
statistical software for the testing of coecients of RR and LR except "ridge" R
package for ridge regression and "lrmest" for LR. The existing software or R
packages though perform estimation and testing of ridge or Liu regression
coecients but either they perform estimation and/or testing of coecients with
out scaling of regressors or compute results for population size (see Appendix A for
further detail about these software and R packages). Furthermore, none of the
existing software or packages computes RR or LR related properties/ statistics.
The computation of dierent biasing parameters of the RR and LR is not available
in the existing software or R packages.
All these deciencies in detection and remedy of multicollinearity, estimation and
testing of multicollinear linear models, motivate us to develop better detection and
estimation technique(s) and also make a comprehensive statistical packages in R
language that have maximum possible detection methods coupled with estimation
and testing of collinear model's coecients. The package will also provide some
relevant graphical representations (such as ridge trace) of results from the RR and
4
LR. The computation of biasing parameters from dierent researchers, information
criteria (AIC and BIC), prediction sum of squares (PRESS) and cross validation
etc., are also available in our developed R packages (namely lmridge and liureg)
constructed in this work. The collinearity detection results from package mctest will
also interpret or indicate the regressor(s) causing collinearity problem.
1.3 Diagnosing Multicollinearity
Consider the linear regression model between two or more regressors is,
y = Xβ + u,
where E(u) = 0 and E[u′u] = σ2Ip. It is also considered that regressors are
collinear. The OLS estimate of the model and its Var-Cov matrix depends to a
great extent on the characteristics of the matrix X ′X. Therefore, various numerical
and graphical methods have been developed for the detection/ diagnosis of
existence of multicollinearity; some of them proposed by Belsley et al. (1980);
Farrar and Glauber (1967); Kendall (1957); Kumar (1975); Marquardt (1970)
among many. Dierent techniques for remedy of multicollinearity have been
developed that range from simple to more specied methods for regularization (see
Næs and Idahl, 1998). Complete elimination of multicollinearity is not possible but
the degree of multicollinearity can be reduced by adopting the RR, LR, and CPR
etc. Some widely used diagnostics for multicollinearity in the literature are VIF/
TOL, CN, CI, eigenvalues, R2 Farrar & Glauber's test and Theil's measure. For
detail and list of diagnostic measures, see Section 3.5 of Chapter 3.
5
1.4 Estimation
Under the usual assumptions of the CLRM, the OLS estimator (OLSE) is
considered to be the best choice, however, in case of existence of multicollinearity
among regressors, the OLSE becomes unstable as OLSE depends on the
characteristics of the design matrix X ′X. However, various methods to combat
multicollinearity are proposed; RR (Hoerl and Kennard, 1970a,b) and Liu
regression Liu (1993) are the most popular and widely used techniques.
1.4.1 Ridge Regression
The basic requirement for the OLS method is that (X ′X)−1 exists. There may be
two reasons that the inverse of X ′X does not exists (i) P < n and (ii) collinearity.
The RR technique is one of the most popular and best performing alternative to
the OLS methods (Frank and Friedman, 1993), as the RR procedure is based on the
matrix (X ′X+kI) instead of X ′X and is used in ill-condition situation causing X ′X
matrix to be close to singular.
The RR estimator (RRE) are given as βR = (X ′X + KI)−1X ′y, where k ≥ 0 is
biasing or ridge parameter, also known as shrinkage parameter. It is also assumed
that X and y are standardized so that X ′X is in the correlation form and X ′y is the
vector of correlation of the dependent variable with each of the regressors.
The RR procedure provides estimators having smaller mean square error (MSE)
than those of the common OLSE. However, MSE of ridge estimator is a function of
the unknown parameter k. A very important statistical challenge in the RR is to
determine the optimal value of k, because adding a small constant to the diagonal
elements of the matrix X ′X improves the conditioning of the matrix which can be
recognized by numerical analysis, as this would decrease its CN drastically (see Vinod
and Ullah, 1981; D'Ambra and Sarnacchiaro, 2010). For further details about RR,
its properties and working of lmrdige package, see Chapter 4.
6
1.4.2 The Liu Regression
Liu (1993) proposed a biased estimator by combining the advantages of ridge estimate
βR and the Stain estimator (see Stein, 1956; James and Stein, 1961) βs = cβ; where
0 < c < 1 is a parameter. The Liu estimator (LE) can be written as
βd = (X ′X + Ip)−1(X ′y + dβ),
where d is the Liu biasing parameter, Ip is the identity matrix of order p× p.
The βd is named as Liu estimator (LE) by Akdeniz and Kaciranlar (1995) and Gruber
(1998). The suitable selection of d at which MSE is minimum and eciency of
estimators improves as compared to other values of d is the main interest of the LE.
For selection d and to overcome the problem of collinearity in an eective manner, Liu
provided some important methods and also provided numerical example. For further
details about the LR, its properties and working of liureg package, See Chapter 5.
1.5 Testing
Investigation of the individual coecients in a linear but biased regression model,
ridge based exact and non-exact t-type and F -test would be used. Exact t-statistics
derived by Obenchain (1975) based on the RR for matrix G whose columns are the
normalized eigenvectors of X ′X, is
T ∗ =βRj − bj√ˆvar(βRj − bj)
,
where j = 1, 2, · · · , p, ˆvar(βRj − bj) is an unbiased estimator of the variance of the
numerator in above equation, and
bj = g′i∆G′[I − (X ′X)−1e′i(ei(X
′X)−1e′i)−1]β(0),
7
where g′i is the ith row of G, ∆ is the (p × p) diagonal matrix with ith diagonal
element given by δi = λiλi+k
and ei is the ith row of the identity matrix. Halawa
and El-Bassiouni (2000) presented to tackle the problem of testing H0 : βi = 0 by
considering a non-exact t-type test of the form
T =βRj√S2(βRj)
,
where βRj is the jth element of RE and S2(βRj) is an estimate of the variance of βRj
given by the ith diagonal element of the matrix (see Section 4.6.4).
Similarly, for testing the hypothesis H0 : β 6= β0, where β0 is vector of xed values.
The F -statistic for signicance testing of the ORR estimator βR with E(βR) = ZXβ
and estimate of Cov(βR) is
F =1
p(βR − ZXβ)′ (Cov(βR))
−1(βR − ZXβ)
Our developed packages for detection and remedy of collinearity among regressors
can be used for teaching purposes to get aware about concepts and existence of
collinearity. Variety of collinearity indicators bundled in mctest package will
provide an opportunity to researchers to constantly check their work and also
enable the researchers to acquire experience of the various collinearity indicators in
a self sucient way. Similarly, lmridge and liureg packages not only estimate and
perform testing of coecients for a vector of biasing parameters but also computes
dierent ridge and Liu related statistics with graphical outputs after scaling the
regressors. The developed packages have option of dierent scaling methods of
regressors such as scaling of regressors described by (Belsley et al., 1980; Draper
and Smith, 1998), the standardization, centering and without performing any
scaling on regressors.
Following diagram shows how user of these newly developed packages will interact:
8
Figure 1.1: Flow diagram how to deal with collinear data using these packages
1.6 Main Contribution
The contribution to this work is two fold. Our rst and main contribution to this
study is to develop 3 R packages (i) mctest (ii) lmridge and (iii) liureg for
various multicollinearity diagnostic tests, implementation of biased methods RR
and LR, respectively. mctest package not only computes 16 exiting and widely
used collinearity diagnostics and 2 proposed diagnostics but also interprets or
indicates the regressors causing the problem of collinearity, see Section 3.5 of
Chapter 3. The lmridge and liureg packages not only estimate respective
regression coecient but also computes dierent related properties. These packages
also have functions for graphical representation of dierent measures. lmridge and
liureg packages compute 22 biasing parameter for RR and 4 LR, proposed by
dierent authors along with related residuals, predicted & tted values, R2 &
adj-R2, F -test, testing of coecients, dierent model selection criteria, MSE, bias,
9
EFD, and ridge related plots. For further details, see Chapter 4, Section 4.10 and
Chapter 5, Section, 5.5.
Other contribution to present study is listing maximum available ridge and Liu
related properties. A concise discussion on their merits/ demerits, application and
use of properties along with discussion on their possible resulting outputs.
Similarly, listing of popular and widely used collinearity diagnostic measures, their
merits/ demerits, objection from dierent authors, application and use of
collinearity tests along with discussion on their possible resulting outputs and
diagnostic capabilities. Our proposed collinearity diagnostics IND1 and IND2 are
empirically veried in Imdadullah et al. (2016).
Our developed packages (mctest, lmridge and liureg) can be downloaded from
the comprehensive R archive network (CRAN) by following the URL https://cran.
r-project.org/web/packages/available_packages_by_name.html or it can also
be obtained by emailing to author at, [email protected].
The results/output of our all developed packages is consistent with existing software/
packages or outputs in textbook. The dierences that exist are only due to for
example, use of standard deviation from population and degrees of freedom (df) in
estimate of residual mean square.
In the next Chapter, we will discuss what is an R package, how it is developed and
example of ridge related package.
10
Chapter 2
R Package Development: Some
Preliminaries
2.1 Introduction
In many situations, data may be plagued with multicollinearity. Therefore, there is
need in statistical software to have routines for detection of multicollinearity. Most
of the available statistics related software or R packages detect collinearity through
VIF or CI/CN measures, though there are many other methods do exist in the
literature. Similarly, few of the statistical and econometric software contain remedy
of multicollinearity using the RR.
In the present work, we developed three packages in R language; one for collinearity
detection and other two for remedy of it. In this chapter, we present a comprehensive
review for such package development in R language using S3 class system.
R is a free object oriented programming (scripting) language and environment, used
for statistical data manipulation and analysis. The success of R Project (R Core
Team, 2015) is based on the R packaging system which allows easy, transparent and
variety of platforms (such as UNIX, FreeBSD, Linux and Windows, etc.) for the R
11
base system. The R package can be conceived as software equivalent of research
paper having some de facto standard and divided in dierent sections such as
introduction, literature review, research methodology, main results or ndings and
nally application or simulation of results; to communicate research paper to
researchers (Leisch, 2008). Similarly, R package system has some standards to
follow and it provides a communication channel for author's work, to organize and
administer in better way.
R packages can be considered as compendium of variety of dierent resources such
as R functions (source code), data sets and documentation that are all used to
distribute statistical methodology to colleagues and co-workers. The R packaging
system permits people to contribute to R, and also serves as a convenient way to
preserve private functions.
The functions and objects in a package can be installed on a machine and can easily
be loaded. R package contains methods that make use of new or existing statistical
techniques and provide tools to work with big data for graphics, data exploration,
complex numerical techniques, while simulated, existing and research data sets can
be shared, and R packages also support reproducibility.
During the R package development procedure, the les (source, data and help les
etc.) are organized in a standardized way, in a compressed single le and these
(compressed) les are used to build an installed version of the package being
developed in another directory. The R package development work-ow is to make
some changes, to build and install the package, unload and reload the package and
then test them as necessary. Using Build and Reload commands, all these steps
should be performed in sequence to fully rebuild a package. Technically, this means
that,in a package, R objects can be represented eciently, lazy loading of large
object(s) can be enabled, the functions they make are available publicly
(namespaces) and can have codes written in other languages (C/C++ or
FORTRAN). Similarly, help les about functions and data sets that can be in
12
dierent forms while other supporting les are also included (however, in R session
these les cannot be used directly) but permit to check that the developed R
package works as asserted (see Leisch, 2008; Team, 2015; Hadley, 2015, etc).
After successful installation of R packages and to use them in an R session, they
can be loaded and unloaded dynamically on runtime by using their name as
argument to library() or require() function and hence they occupy computer
memory only when they are actually used. The loading (attaching) of the package
refers to the name of a subdirectory (sub-folder) in a library directory. Inside these
package directory, les and their subdirectory are used by R evaluator and utilities,
whereas, installation and updates of R packages can be performed from inside or
outside R environment.
The R Packaging system has dierent tools that comes with R itself, used for software
installation and validation to check the existence of R documentation (manual or help
les), to check that does it is in sync with the code technically, to blot common errors
and also to assure that does provided example(s) in R documentation actually runs
or not. These tools also create compressed archive (.zip and/or .tar.zip) le and other
les needed to document the package so that they can be shared or reused easily.
Usually, these tools are accessed from a command prompt (command shell), such as
> R CMD operation
where operation is one of the R shell tools.
By executing these R tools (as R requires), one should ensure that the R tools have
access to information about the local installation of R. In developing (your) own R
package, the most important operations are installation, taking the source package
and making it available as an installed R package.
For building R packages in Windows operating systems, main prerequisites are:
1. GNU software development tools (Rtools utility) that include C/C++ compiler
2. Latex (MikTex distribution)
13
3. Microsoft Help Workshop for creating R manuals/ documentations and
vignettes
After successful installation of these required tools, the PATH variables are edited (or
checked) to make sure that operating system can nd the R commands rst, when
creating the R packages. Depending on the installation, the directory path may be;
PATH = C:\ Rtools\bin;
C:\ Rtools\MinGW\bin;
C:\ Program Files\HTML Help Workshop;
C:\ Program Files\R\Re -3.2.3\ bin;
C:\ Program Files\MiKTex2 .9\ miktex\bin\x64;
2.2 R Code for Linear Ridge Regression
For basic understanding of building an R package, we try to code (write) simple
function in R which computes the linear ridge estimate and has outcome similar to
lm.ridge() function of MASS package (Venables and Ripley, 2002a). We name this
exemplary package as lmridge, computes ridge coecients and their signicance
testing for single biasing parameter (k) by using some scaling scheme as describe in
equation (4.3) of Chapter 4. For R package development and R documentation (see
Leisch, 2008; Team, 2015; Hadley, 2015, etc).
Consider a standard linear regression model with issue of multicollinearity
y = X β + ε, ε ∼ N(0, σ2I ) (2.1)
For a given X (design matrix) and y (response vector), the linear ridge estimate
(Hoerl and Kennard, 1970b) is
βR = (X ′X + kIp)−1X ′y (2.2)
14
with V ar-Cov matrix
Cov(βR) = σ2(X ′X + kIp)−1X ′X(X ′X + kIp)
−1 (2.3)
To compute βR, we used singular value decomposition (SVD) ofX, which numerically
most stable (Seber and Lee, 2003). Few functions from package lmridge are provided
as an example here for illustration of R package development. A minimal R function
for estimation of linear RR coecient is;
lmridgeEst <-function(formula , data , K=0, ...)
if(is.null(K))
K<-NULL
else
K<-K
mf<-model.frame(formula=formula ,data=data)
x<-model.matrix(attr(mf ,"terms"), data=mf)
y<-model.response(mf)
mt <- attr(mf , "terms")
p<-ncol(x)
n<-nrow(x)
if(Inter <-attr(mt, "intercept"))
Xm<-colMeans(x[, -Inter ])
Ym<-mean(y)
Y<-y-Ym
p<-p-1
X<- x[,-Inter]-rep(Xm,rep(n,p))
else
Xm<-colMeans(x)
Ym<-mean(y)
Y<-y-Ym
X<-x-rep(Xm, rep(n,p))
Xscale <- (drop(rep(1/(n-1),n)%*%X^2) ^0.5)*sqrt(n-1)
X<-X/rep(Xscale ,rep(n,p))
Xs<-svd(X)
rhs <-t(Xs$u)%*%Y
d<-Xs$d
div <-d^2 + K
15
a <- drop(d*rhs)/div
coef <-Xs$v%*%a
rownames(coef)<-colnames(X)
Z<-solve(crossprod(X,X)+diag(K,p))%*%t(X)
rfit <-X%*%coef
resid <-Y-rfit
hatr <-X%*%Z
colnames(coef)<-paste("K=", K,sep="")
list(coef=coef , xscale=Xscale , xs=X, y=Y, d=d, xm=Xm, ym=Ym, K=K,
Inter=Inter , rfit=rfit , hatr=hatr , Z=Z, resid=resid)
The lmridgeEst() function computes estimate of the ridge coecient with further
required statistics such as the ridge residuals and tted values. The selection of
variable from a data frame for model tting is done using formula interface in R
standard way, such as
y ∼ x1 + x2 + x3 + x4 (2.4)
The key object generally created from formula, is model.frame() which is a generic
function that returns a data frame that contains only the variable that appear in
the formula, coupled with an interpretation of formula in the terms attributes. The
design matrix is created using function model.frame() for the regression model and
model.response() function is used to get the response variable. lmridgeEst() is
used to t linear RR model for the Hald data set (Hald, 1952) with biasing parameter
k = 0.1. The R command and its partial outputs are
> lmridgeEst(Y~X1+X2+X3+X4, data=Hald , K=0.1)
$coef
K=0.1
X1 22.407283
X2 15.624011
X3 -6.029576
X4 -19.928493
$d
[1] 1.49522708 1.25541470 0.43197934 0.04029573
16
$div
[1] 2.3357040 1.6760661 0.2866061 0.1016237
$xm
X1 X2 X3 X4
7.461538 48.153846 11.769231 30.000000
$ym
[1] 95.42308
$K
[1] 0.1
This output from lmridgeEst() needs some formatting and use of generic functions
such as summary() and plot() for signicance testing of ridge estimates and plotting
of ridge trace using these generic functions respectively. The lmridgeEst() function
returns a list of objects named coef, k, resid, and rfit, etc. This list of object
contains vector, matrix and lists. For formatting of output from functions, R classes
and methods are used and described in next section.
2.2.1 Classes and Methods
For formatting of output, R classes are used to dene how dierent object of certain
type (mode) will look like (presented in R Console), while R methods are used to
dene special function that operate on objects of a certain class. Note that an object
in R, is an instance of the class that exists at run time, and whenever class is used
to store the results for a given dataset, an object of that class is created.
A class in R, is a set of objects that shares specic attributes while a method is
the name for a function that can be applied to dierent types (modes) of objects.
For example, an object of class "lm" is a list with some specic attributes generated
from the lm() function, whereas print(), summary(), plot() and predict() etc.,
are examples of dierent methods (John, 2002).
The R language has two types of object systems, S3 and S4. In S3, R object(s),
17
classe(s) and method(s) are informal and very interactive. This class system was
rst described by Chambers and Hastie (1992) and is called by S programmers as
"White Book". Object(s), class(es) and method(s) in class S4 are more formal and
rigorous. In other words, the S4 class system is less interactive as compared to S3.
The S4 system was rst described by Chamber (1998) in "Green Book". The packages
written in S4 class system are available in R since version 1.7.0 (see Leisch, 2008).
In S3 class system, there is no formal denition of a class. An R object can be created
from a new S3 class by simply setting the class attribute of the created object to the
name of the desired class (see John, 2002; Leisch, 2008). that is,
> results <- x
> class(results) <-"lmridge"
In R, classes are attached to an object as an attribute that determines the (specic)
behaviour of a generic function(s) such as print(), summary(), and plot() etc., by
invoking a method appropriate to the class of that object. Therefore, a generic
function can perform dierent operations on object(s) having dierent classes. In
the S3 class system, generic function(s) takes a look at the class of their rst
argument given and method dispatches, based on naming convention. In other
words, the generic methods pass an object to its specic method. That is, when
print() function is called with an argument of class "lmridge", it looks for a
function print.lmridge(). More technically, print() function does not display or
print the results actually, but it looks at the class of an object passed and then calls
the specic print method of that class. If no method for required object class
exists then default method such as print.default() will be used.
Once the classes are dened, we may need to perform some calculations on objects.
To perform computation on objects, it is required to use of generic functions and
method dispatch. A generic function has a special body that generally contains a
call to UseMethod(), that islmridge <- function(x , ...)
UseMethod("lmridge")
18
If author's package contains a function that intended to be used as a generic function,
for example, print.lmridge for class lmridge, then it should be indicated in the
NAMESPACE le (see Section 2.3.4) by using an S3method directives, to ensure
that these methods are available in the package. For example, following directive
will ensure that the method is registered and is available for UseMethod dispatch,
and print.lmridge needs not to be exported.
S3method(print , lmridge)
To write a formula interface, the main function lmridge() need to be generic and
need to write a method named "default" whose rst argument is a design matrix
(or some other data structure that can be converted to a matrix) that is
lmridge.default function is dened to have a default method.
lmridge <- function(x ,...)UseMethod("lmridge")
lmridge.default <- function(formula , data , K=0, ...)
est <- lmridgeEst(formula , data , K, ...)
est$call <- match.call()
class(est) <- "lmridge"
est
The default method, lmridge.default() is dened which calls lmridgeEst()
function for the estimation of ridge parameter. The class of the returned object by
the "default" method is set to "lmridge". The coef() function rescales ridge
coecients, that is,
coef.lmridge <-function(object , ...)
scaledcoef <-t(as.matrix(object$coef/object$xscale))
if(object$Inter)
inter <-object$ym-scaledcoef%*%object$xm
scaledcoef <-cbind(Intercept=inter , scaledcoef)
colnames(scaledcoef)[1] <-"Intercept"
else
scaledcoef <-t(as.matrix(object$coef/object$xscale))
drop(scaledcoef)
The coef() function need an argument of class "lmridge", such as
19
> coef(lmridge(Y~X1+X2+X3+X4, data=Hald , K=0.1))
Intercept X1 X2 X3 X4
86.7701594 1.0996246 0.2898463 -0.2717493 -0.3436969
The print() method for lmridge() is used to have formatted output of ridge model
coecients for given biasing parameter(s) k.
print.lmridge <-function(x, ...)
cat("Call:\n",paste(deparse(x$call),sep="\n",collapse="\n"),
"\n",sep="")
print(coef(x) ,...)
cat("\n")
invisible(x)
The enhanced output for linear ridge coecients for biasing parameter k = 0.1 is,
> lmridge(Y ~ X1 + X2 + X3 + X4 , data=Hald , K=0.1)
Call:
lmridge.default(formula = Y ~ X1+X2+X3+X4, data=Hald , K=0.1)
Intercept X1 X2 X3 X4
86.7701594 1.0996246 0.2898463 -0.2717493 -0.3436969
Note that, rescaled ridge coecients will be printed in R console because generic
function print calls an object of class "lmridge" in function coef() dened above.
After tting the model, dierent methods such as summary and plots are required
to investigate the results. The estimation of parameter of a regression model
provided are summarized using matrix that contain 5 columns namely estimate,
scaled estimate, standard error, t-test values and p-values for parameter estimates.
summary.lmridge <-function(object , ...)
res <-vector("list")
res$call <-object$call
y<-object$y
n<-nrow(object$xs)
rcoefs <-object$coef
hatr <-as.matrix(object$hatr)
edf <-n-sum(diag(2*hatr -hatr%*%t(hatr)))
ZZt <-object$Z%*%t(object$Z)
20
vcov <-sum(object$resid ^2)/edf * ZZt
SE<-sqrt(diag(vcov))
tstats <-(rcoefs/SE)
pvalue <-2*(1-pnorm(abs(tstats)))
coefs <-coef(object)
b0<-object$ym -colSums(rcoefs*object$xm)
seb0 <-sqrt(var(y)+sum(object$x)+sum(diag(vcov) ) )
summary <-vector("list")
if(object$Inter)
summary$coefficients <-cbind(coefs , c(b0 , rcoefs), c(seb0 , SE), c(
b0/seb0 , tstats), c(NA , pvalue))
colnames(summary$coefficients)<- c("Estimate", "Estimate (Sc)", "
StdErr (Sc)", "t-value (Sc)", "Pr(>|t|)")
else
summary$coefficients <-cbind(coefs[-1], rcoefs , SE , tstats ,
pvalue)
colnames(summary$coefficients)<- c("Estimate", "Estimate (Sc)", "
StdErr", "t-value", "Pr(>|t|)")
summary$K<-object$K
res$summary <-summary
class(res)<-"summarylmridge"
res
The print() method for summary function is dened to have computation and
display of results in R console separately.
print.summarylmridge <- function (x, digits = max(3, getOption("
digits") - 3), signif.stars = getOption("show.signif.stars"),
...)
CSummary <- x$summary
cat("\nCall :\n", paste(deparse(x$call), sep="\n", collapse="\n"),
"\n\n", sep="")
cat("Coefficients: for Ridge parameter K=", CSummary$K, "\n")
coefs <- CSummary$coefficients
printCoefmat(coefs , digits=digits , signif.stars=signif.stars , P.
values = TRUE , has.Pvalue = TRUE , na.print="NA", ...)
invisible(x)
The utility function printCoefmate() is used to display the matrix of output with
some rounding of digits and formating to print the outcome from
summary.lmridge().
21
Only few functions from our package lmridge are shown here and or not as in actual
package. Other function available in lmridge package and their description is in
table below. All function perform calculation on each biasing parameter provided as
argument in lmridge function our newly built lmridge package.
Table 2.1: Functions and methods in lmridge Package
Functions Description
lmridgeEst() The main model tting function for implementation of ridge regression
models in R.
lmridge() Generic function and default method that calls lmridgeEst function
and returns an object of S3 class "lmridge" with dierent set of methods
to standard generics. It has a print method for display of ridge de-scaled
coecients
press() Generic function that computes prediction residual error sum of squares
(PRESS) for ridge coecients.
summary() Standard ridge regression output (coecient estimates, scaled
coecients estimates, standard errors, t-values and p-values); returns
an object of class "summaryridge" containing the relative summary
statistics and have a print() method.
coef() Display de-scaled ridge coecients
vcov() Displays associated variance-covariance matrix with matching ridge
parameter k values
predict() Produces predicted value(s) by evaluating the function lmridgeEst in
the frame newdata
tted() Displays ridge tted values for observed data.
residuals() Display ridge residuals values.
kest() Displays various k (biasing parameter) values from dierent authors
available in literature and have a print() method.
22
Functions Description
rstats1() Generic function that displays dierent statistics of ridge regression
such as MSE, bias and R2 etc., and have print() method.
rstats2() Generic function that displays dierent statistics of ridge regression
such as df, m-scale and LSRM etc., and have print() method.
hatr() Generic function that displays hat matrix from ridge regression.
inforcr() Generic function that compute information criteria AIC and BIC.
vif() Generic function that computes VIF values.
plot() Ridge and VIF trace plot against biasing parameter k.
bias.plot() Bias-Variance tradeo plot. Plot of ridge MSE, bias and variance
against k
cv.plot() Cross validation plots of CV and GCV against biasing parameter k.
info.plot() Plot of AIC and BIC against k.
isrm.plot() Plots ISRM and m-scale measure.
rplots.plot() Miscellaneous ridge related plots such as df-trace, RSS and PRESS
plots.
2.3 R Packages and its Components
After coding and having a nice user interface, a new package can be created in two
ways:
1. The simplest way to create a package is to rst load all of the relevant functions
(create a workspace) and data sets (that should be in R package) into a clean
R session. The source le should have extension *.R, while the mixture of
source le and workspace cannot be used. Make sure that current directory
(folder) is set to a place where you want to create the R package and run
23
package.skeleton() function to generate a package directory and several sub-
directory automatically in the required structure. This function prints out a
list of things that have to be done.
package.skeleton(name="lmridge", code_files =
c("fun1.R", "fun2.R", "fun3.R"), namespace=TRUE)
The newly created package has name lmridge dened in package.skeleton()
function that contains skeleton of the package for all functions, methods and
classes dened in the R code(s) passed on to the code_files argument. If
no source les are passed to package.skeleton() argument, then all available
objects in user's current workspace will be used.
2. Create package manually, but this is for experienced developers
A useful package have sub-directories of man, R and data. Following is a short detail
about content of package directory, created by package.skeleton() function.
Table 2.2: Package directory content and its description
Content Description
data A sub-directory that contains *.rda les for each data
object loaded in workspace
DESCRIPTION A general package information le that contains basic
description of the package created, author and license
conditions (such as GPL-2) in a structured text format
man A sub-directory that contains help les (R
documentations/ Manual). These le are in simple
markup language similar to LaTex and can be processed
to dierent formats such as LaTex, html and plain text
NAMESPACE Manages functions, methods and dependency information
R A sub-directory that contains *.R les for each function
src A sub-directory that contains C/ C++ or FORTRAN
functions only, if R functions calls these native code(s)
Read-and-delete-me This le contains some instructions for completing the
package and can be deleted
24
Note that the capitalization of les and directories in important because R language
is case-sensitive.
2.3.1 The Package DESCRIPTION File
The description le gives general information about the package in which it appears
(see Team, 2015, pp. 417) and (Leisch, 2008). The eld names are case sensitive
and should be written in ASCII format.
Table 2.3: Package description le
Fields Description
Package Used for giving an ocial name to package. The name should follow
certain rules such as the name of package must start with alphabet and
can contain combination of alphabets, numbers and dot character.
Version Used to dene the version of package. The Version is a sequence of
non-negative integers (at least two) separated by dots or dashes.
Title The Title eld is used in various package listing. The number of
character in Title should not be more than 65.
Author It describes who wrote the package. It should contain at least one
author.
Maintainer This eld should have one name and valid e-mail address (corresponding
author's email).
Description This led should contain comprehensive description of what package
does, and can be of any length but only one paragraph.
Suggest This eld informs that code in package uses some functionality from
another existing R package "lm".
Depend It can be comma-separated list of R package(s) that are needed to be
loaded in order to run the compiled package. It may also include details
of required package version, such as; Depends:R(>= 3.2.3), lm.License This eld can be free text, standardized abbreviation (such as GPL-2,
GPL-3, LGPL-3, and BSD_2_clause etc) can be used if package hat
to be submitted to CRAN, R Forge or Bioconductor repositories.
LazyData It is set to yes, if package contains data object and use lazy loading.
25
The minimal example of DESCRIPTION le for package lmridge is:
Package: lmridge
Title: Linear Ridge Regression
Version: 1.0
Date: 2015 -12 -01
Author: Muhammad Imdadullah and Muhammad Aslam
Maintainer: Muhammad Imdadullah <[email protected] >
Description: A linear ridge regression for testing of ridge
coefficients and estimation of biasing parameter.
Suggests: lm
License: GPL (>= 2)
The mandatory elds are: Package, Title, Version, Author, Maintainer,
Description, and License, while remaining are optional such as Date, Suggests,
Depends and LazyData etc. The optional elds takes logical values that are
specied as "yes", "true", "no" or "false".
For further detail about all elds in package DESCRIPTION le see, "Writing R
Extensions" manual by R Development Core Team (2015).
2.3.2 R Documentation/ Help les
All exported function, objects and data sets in R package should have complete
documentation that describes how to use the functions and sample data. The sources
of R help les format is similar to LaTex and have le extension of .Rd or .rd, however
all LaTex commands are not available. The R documentation (Rd) les can be in
html, plain text, GNU info and old nro-based S help format too. A sub-directory
named "man" having no documentation may result an installation error (see Team,
2015, pp. 5774),and (Leisch, 2008). An exemplary help for "lmridge" function is.
\namelmridge
\aliaslmridge
\aliaslmridge.default
\aliaslmrdige.formula
\aliasprint.lmridge
\aliassummary.lmridge
\aliasprint.summary.lmridge
\titleLinear Ridge Regression
26
\descriptionFits linear RR for estimation of biasing parameter and
signifianct testing of ridge coefficient .
\usagelmridge(x, y, ...)
\methodlmridge defalut (x, y, ...)
\methodlmridge formula (formula , data=list(), ...)
\methodprint lmridge (x ,...)
\methodsummary lmridge (object , ...)
\arguments
\itemx regressors for model as design matrix (x)
\itemy vector of response variable (y)
\itemformula symbolic representation of the regression model to
be fit
\itemdataan optional dataframe that contain variable used in
the regression model
\itemobject an object of class \code"lmridge", a fitted model
\item\dotsnot used
\value
An object of class \codelmridge is a list that includes following
elements
\itemcoefficients a named vector of ridge coefficients
\itemK baising parameter
\itemscaling design matrix scaling
\authorMuhammad Imdadullah , Muhammad Aslam
\examples dt<-data(Hald)
mod1 <- lmridge(Y~., data=dt)
mod1
summary(mod1)
\keywordridge regression , regularization
27
Table 2.4: Fields of R help les
Field Description
\namename This eld is used to name the help le of the
package
\aliastopic It has usually multiple entries, one for each word
will lead to the help page when used after ? mark
\titleTitle It used for short description of the topic with rst
letter capital and should not end with a full stop
\description... It contains few lines that describes the topic
\usagefun(arg1, ...) It is used for syntax of function call, showing
arguments of the function
\arguments... Description of each argument used in the function
\details... Precise details (description) about the function that
what it does?
The example section of help should contain executable R code. There are two markup
commands for example
Table 2.5: Markup command for executable code
Field Description
\dontshow Inside \dontshow() , the R code executes by the example() or tests,
but in the help page not presented to the user.
\dontrun Inside \dontrun() , the R code is not executed by the example() or
tests.
There are other sections of help le and other ways to specify equations, URLs, and
links to other R documentations etc. For further details of all Rd commands (see
Team, 2015, pp. 5774).
28
2.3.3 Data in R Package
Data can be used from recommended package because these packages are part of any
R installation. To add your own data set to the package you wrote, save the required
data using save() function and copy & pasting the resulting le (*.rda or *.Rdata)
to sub-directory named "data" in your package. Data can be in other recommended
data formats such as txt, csv, S code, etc.
Table 2.6: Documenting data sets elds
Field Description
\namename It is used to name the data object and help le
\docTypedata It is alway data for data sets
\aliastopic Topic as used for function
\titleTitle Short description of the data object
\usagename When lazy data is not in eect. The data(name) can be
used
\format... Description of object. If object is a list or a dataframe,
then each item need to be documented individually
\source... Detail of original source of the data object
\references... References to secondary sources
\examples... Examples about how to use the data object such as loading
data, making plot etc
\keyworddatasets Always datasets
2.3.4 Importing and Exporting Objects from Namespaces
R automatically creates a namespace for the package being build. When a package is
loaded (in R session), only items that are exported, are placed in the attached frame,
although all are loaded. Only those objects should be exported which author would
like to use. The skeleton will export everything which is not recommended because
29
you will be forced to create and write a help le for every object. For example, use
export(a,b) to export item say a and b, while use exportPattern("^\\,]") to
dene the items using pattern. For further detail, see (Team, 2015).
If someone has used an item (object) from a namespace from another package, he
should import it. For example, using import(package1, package2) will import all
items from packages mentioned in parenthesis, while using importFrom(package1,
package2, a, b) will import items a and b from package1 only.
2.3.5 Non-R Scripts
All non-R scripts (codes), compiled in C, C++ and FORTRAN etc., should be
included in src subdirectory.
2.4 To Build, Check and Install an R Package
R CMD is the program to build/ create an R package. On Windows machine this
program compile the package into zip le. The path of this program need to be set
using path variable in system environment variables. At DOS prompt type following
command after setting path to package, i.e.
R CMD INSTALL -- BUILD lmridge
Warnings and errors may occur in the check stage. CRAN does not accept any
package that has warnings and errors from check. Rtools and Latex compiler (such
as MiKTex distribution) is required if running check. Therefore, to check the quality
of our package lmridge, CHECK command is used;
R CMD CHECK lmridge
Note that without editing les created by package.skeleton() function, build and
check the package phase will fail. To build a package for other operating systems
such as Linux, use following command
30
R CMD BUILD lmridge
A compressed package of extension *.tar.gz le will be created, which ban be installed
on a non-windows based machines. For further details, read "Writing R Extensions"
from R CRAN.
To create the zip le (i.e., lmridge.zip) le of package, use
R CMD BUILD -- BINARY lmridge
The check tool, tests whether R source package works correctly. The series of checks
are run on a archive prepared by R CMD BUILD. The description of these checks are
given in Table 2.7.
Table 2.7: R CMD program's check List for the quality of R package
Check List Description
install Is it possible to install the package?
portability Are all le name used valid across le system and also
supported operating systems?
permission Do the les and directories have sucient permission?
binary Does binary (executable) les exists? Warning message
will appear.
description file Check for the completeness and partially for the
correctness of the DESCRIPTION le.
subdirectories Does subdirectories have suitable names and are not
empty?
R files Is the R syntax is correct?
load Can package be loaded?
Code problem Checks R code for problems and see if calls to
library.dynam and C etc. can be interpreted sensibly.
Rd Checks the format for correct syntax, metadata and
missing links.
31
Check List Description
undocumented items Does.Rd le exists corresponding to each exported
function?
documentation Checks for consistency exists between use of functions
and datasets.
usage Does function(s) arguments provided in usage eld of
le(s) having extension .Rd are documented in the
corresponding section of arguments?
examples Does examples provided in package's documentation run?
For further details about how to create a package in R language, see Leisch (2008),
Ligges (2003), Hornik (2015) and Team (2015).
32
Chapter 3
Multicollinearity Diagnostics and
mctest Package
3.1 Introduction
The linear regression model (LRM) and its oshoots such as two or three stage least
squares (2-SLS, 3-SLS) have been widely used as quantitative tools for the social
and physical sciences, over the last several decades. The use of OLS method is
popular due to its low computational cost, it visceral plausibility in a wide variety
of circumstances and its support by a spacious and sophisticated body of statistical
inferences. The OLS method First, applied descriptively as a mean of curve tting
merely. Second, it is used for testing of hypothesis and Thirdly, it provides an
environment in which statistical theory, subject eld related specic theory and data
may be brought together to enhance our realization of complex social and physical
phenomena. Relevant statistical theory has been originated from each of the above
three perspectives and practical guidelines have also been developed (Belsley et al.,
1980; Belsley, 1991).
The degree of understanding of practical experience and theoretical support cannot
33
be said to exist when examining and evaluating the quality and potential inuence
of the data (that are assumed "given") because the thrust of standard regression
theory is based on sampling variation or uctuation which is reected in the
regression coecients, V ar-Cov matrix and associated statistical tests such as
t-test, F -test, and prediction intervals etc. The regressors are treated as xed, but
actually, data and model may be in conict in ways not readily analyzed or
examined by existing standard statistical procedures. Thus, after examining the
signicance tests (such as t-test and F -test) and all the model variants have been
compared, the researcher often feels that his/ her results form regression analysis
are less meaningful (or signicant) and less trustworthy than might otherwise be
the case, because of potential problem(s) with the data the problem(s) that are
generally neglected in practice. Similarly, the dierent subsets of the data may
produce very dissimilar results, raising some questions about stability of the
statistical model. On the other hand, when the researcher knows that certain
observation (few data points) pertain to some unusual circumstances (such as
strikes, wars and ood etc.) but he/ she is unsure of the extent to which the results
depends for good or ominous. In data collecting procedure, more pernicious
situation may arise when an unknown error generates an anomalous data point(s)
that cannot be surmised (on some prior grounds). The researcher may feel that
collinearity is causing some trouble(s), possibly generating some non-signicant
estimates of regression coecients supposed to be important on the basis of
theoretical considerations (Belsley et al., 1980; Belsley, 1991).
For small regression model, researchers often detect some form of (multi)collinearity
or even some unusual data point(s) during the process of handling the data by using
statistical software and by some use of dierent descriptive statistics. The usage
of very high-speed computers in current era and use of large size data and models,
the researcher has become isolated from intimate knowledge about the data being
used, because of cursory examination for the suitability of data. Similarly, data
related problems are often ignored while all the data points are included due to the
34
law of large numbers. However, this is of course incongruous, if some of the data
are in error or they came from dierent regime. On the other hand, the researcher's
understanding of the degree to which regression results depend on specic data sample
being used, does not increase even if all the data are correct and is relevant to
regression model. The researcher may be ignorant of properties that additionally
collected data may have, either to reduce the sensitivity of the estimated regression
model to some part of the data, or to remedy ill-conditioned data that may be
precluding useful estimation of some parameters altogether (Belsley et al., 1980;
Belsley, 1991).
The objective of multiple regression analysis is to estimate the relationship of
individual parameters of a dependency but not of interdependency by assuming
that the response variable y and the regressors (X's) are linearly related to each
other (see Graybill, 1980; Johnston, 1963; Johnston and DiNardo, 1997; Malinvaud,
1968). Our focus is on to draw some inferences such as (i) identify the relative
inuence of the regressor(s), (ii) prediction and/or estimation and (iii) selection of
an appropriate set of regressor(s) for the regression model. So, our intention to use
regression model is to nd out at what extent the dependent variable can be
predicted by the relevant regressors. For this purpose, usually R2 (the coecient of
determination) is used to indicate the strength of prediction also called goodness of
t of the regression model. The model t is considered to be good if the overall
value of R2 is high enough, that is near to 1. The model t becomes poor or very
poorer when important signicant regressor(s) or variables(s) is/are omitted from
the model.
In order to draw inferences from regression analysis, the regressors should have no
linear relationship between themself, i.e they should be orthogonal, but in most of
the application of regression analysis, regressors are not orthogonal which leads to
misleading or erroneous inferences that made from regression analysis, specially, in
case when regressors are perfectly linearly or nearly perfectly related (dependent
35
with each other), which is known as problem of multicollinearity (see Gunst and
Mason, 1977; Gunst, 1983; Mason et al., 1975; Ragnar, 1934); a term rst used
by Ragnar (1934). Multicollinearity is the lack of independence or the presence
of interdependence signied by high inter-correlations (R = X ′X) within a set of
regressors (see Dorsett et al., 1983; Farrar and Glauber, 1967; Gunst, 1983; Gunst
and Mason, 1977; Mason et al., 1975). Perfect multicollinearity is not a problem as
it can easily be detected and resolved by dropping one of the regressor(s) causing
multicollinearity (Belsley et al., 1980). Multicollinearity is considered as the specic
characteristic of the design matrix X, not a statistical aspects of the LRM, described
in Eq. (2.1). Therefore, multicollinearity is a data related problem, not a statistical
problem (Belsley et al., 1980).
Collinear data usually arise and have potential harm when applying regression
analysis in geophysics, oceanography, econometrics, and all other eld that rely on
non-experimental data. Though many of the possible regressors are highly collinear
(correlated, confounded) which is a fact of life, however it becomes very dicult to
infer the separate inuence of these collinear regressors on the response variable,
because these collinear regressors do not provide information that is very dierent
from that already inherent in others regressor(s). On the other hand, perfect
collinearity destroys the uniqueness of the least square estimators (LSEs) (Belsley
et al., 1980).
Researcher is faced with a problem, when the degree of correlation between regressors
is high enough however not perfect, because, one of the assumption of CLRM is
that regressors are not collinear with each other, is violated. The other related
assumptions are that the number of observations in data must be greater than the
number of regressors being used in the regression model and there should be sucient
variability in the values of regressors.
Statistically, an exact linear relationship exists if c1X1+c2X2+· · ·+ckXk = 0 satised,
where c1, c2, · · · , ck are constants such that all of them are not zero simultaneously.
36
Now a day, multicollinearity is being used for perfect multicollinearity as well as for
not perfect multicollinearity (where the X variables are inter-related) i.e. c1X1 +
c2X2 + · · ·+ ckXk + vi = 0 where vi is stochastic error term.
Strictly speaking, the distinction between collinearity and multicollinearity is that,
the term multicollinearity is used when more than one exact linear relationship
exists among regressors, while term collinearity is used for existence of a single
linear relationship, however, multicollinearity refers to both of the cases now a days.
We will use both terms alternatively as required.
3.2 Sources of Multicollinearity
There are several sources of multicollinearity in data, therefore dierences among
them must be clearly understood, because the interpretation of resulting model
depends on some of the cause(s) of the problem. Following are the sources of
multicollinearity stated in dierent existing literature such as (Gujarati and Porter,
2008; Gunst and Mason, 1977; Koutsoyiannis, 1977; Mason et al., 1975;
Montgomery and Peck, 1982, among many others).
• The data collection method used, e.g. sampling over a limited range of the
values taken by the regressors in the population, i.e. samples are taken from
subspace of the region of regressors.
• Constraints on the tted model or on the population being sampled.
• Model specication problem, that is, when polynomial terms are added in
model, causing ill-conditioning of the X ′X matrix.
• In case when model is overdened, that is, model has more regressors than the
number of observation in the data set.
• When there is some common trend in time series data i.e. regressors may be
growing/ decaying over time approximately at the same (constant) rate.
37
• Improper use of dummy variables (dummy variable trap).
• Inclusion of the same variable twice in model. For example, variables height in
inches and height in feet.
• Inclusion or exclusion of a variable or even certain observation may greatly
change the estimated regression coecients showing the existence of
multicollinearity.
3.3 Consequences of Multicollinearity
Multicollinearity does not lessen the predictive power or reliability of the regression
model as whole, it only aects the individual regressor (Koutsoyiannis, 1977), i.e.
models having correlated regressors can indicate how well/ good the entire collection
of regressors is predicting the response variable, but it may not give valued results
about any individual regressor or about which regressors are redundant with respect
to others.
In case of near or high multicollinearity (existence of linear dependencies among
regressors), following are the potentially serious eects on the regression estimates
and are thoroughly discussed in literature (see Chen, 2012; Belsley, 1991; Gujarati
and Porter, 2008; Gunst, 1983; Rawlings et al., 1998; Swamy et al., 1985, among
many others).
• The statistical/ mathematical software fail to perform matrix inversion
because X ′X matrix becomes singular (ill-conditioned), therefore estimation
of coecients and standard errors is not possible.
• Although the regression coecients are BLUE, but in absolute value β's (|β|)
are too big but they tend to be far from true β's i.e. β is/are too big in absolute
value.
38
• The OLSEs have large variances, covariances and standard errors that make
precise/ accurate estimation of regression coecients dicult.
• Condence intervals tend to become wider due larger standard errors and lead
to accept the null hypothesis of β = 0.
• It also becomes dicult to isolate and measure the separate eect of regressor(s)
on the response variable, that is, estimation of regression coecients becomes
dicult because coecient(s) measures the eect of the corresponding regressor
while holding all other regressors as constant.
• Although the t-ratio of one or more regression coecients tends to be
statistically non-signicant, even though they have signicant eect on the
response variable, while R2 (explained variation) can be relatively very high.
• Increase in the type-II error, that is, failure to nullify the null hypothesis that
the regression coecients are not dierent from zero.
• Structural integrity of the econometric model is eected.
• The OLSEs and their standard errors can be sensitive to small change in the
data point(s), that is, results are not robust.
• The correlated X's, correspond to large values of (X ′X)−1, inates the
estimated variances for Y i.e.
V (y) = V (Xβ) = XV (βX ′)
= σ2X(X ′X)−1X ′
It means the existence of multicollinearity inates the estimated variances of
the predicted values for sets of x values, especially when these values are not
in the sample.
• The sign of parameter estimates dier from the sign of the true parameter.
39
All these theoretical considerations are thought to be important for detection of
multicollinearity among regressors (see Adnan et al., 2006; Belsley, 1991; Chatterjee
and Hadi, 2006; Chen, 2012; Greene, 1993; Younger, 1979, etc.).
3.4 Dealing with Multicollinearity
Following are the dierent ways already available in the literature (see Feldstein, 1973;
Gujarati and Porter, 2008; Johnston and DiNardo, 1997; Maddala, 1992; Wooldridge,
2009) to reduce or to minimize the existence of multicollinearity:
• Exclude one of the most correlated X variable(s) from the model, although
it may lead to model specication error. The objective is to avoid redundant
variable(s) in the regression model (Bowerman et al., 1993).
• Find another regressor (to include in model) related to the concept and study,
which is not collinear with the other regressors.
• Put some constraints on the eects of variables. For example, if two or more
variables have equal eects or eects of equal magnitude but opposite direction,
one might have to compute a new variable. For example, if years of education
and years of job experience are highly collinear, then compute a new variable
such as years of Education + Job Experience and use this instead.
• Increase the sample size, as larger sample reduces the problem of
multicollinearity by reducing standard errors, also additional data points will
tend to produce more variation across the columns of the X matrix, allowing
the better dierentiable eects of the variables.
40
3.5 Multicollinearity Detection Methods
Diagnosing collinearity is important to many practitioner/ researchers of the LS
method, that consists of two related but separate elements (1) detecting the existence
of collinear relationship between the data series (regressors) and (2) assessing the
extent to which these relationship have degraded the parameter estimates. These
diagnostics methods will assist the researcher in determining whether and where
some corrective action is necessary and worthwhile (Belsley et al., 1980).
Kmenta (1980), discussed some warning:
• "Multicollinearity is a question of degree and not of kind. The meaningful
distinction is not between the existence and the absence of (multi)collinearity,
but it is between its several degrees".
• "Multicollinearity refers to the condition of the regressors (that are assumed to
be non-stochastic), it is a characteristics of the sample not of the population".
Existence of multicollinearity should always be tested when examining a data set as
an initial step in multiple regression analysis, because the adverse eects of
multicollinearity and its pitfalls that may exist (see Section 3.3).
Several diagnostic measures for the quantication of collinearity are available in
literature, however, none of these diagnostic measures can be regarded as a
synthetic and normalized method at the same time (Belsley et al., 1980; Silvey,
1969; Kovács et al., 2005). These multicollinearity detection methods can be
classied in two ways:
1. Graphical methods for detection of multicollinearity
2. Numerical methods for detection of multicollinearity
41
3.5.1 Graphical Methods of Diagnostics
• Tableplot for Condition Indices and Variance Proportions
Friendly and Kwan (2003) used the tableplots by making improvement to the
standard tabular display that can be used for diagnostic purpose of
multicollinearity. The tableplot was developed by Kwan (2008) to render
numeric information in a table and are displayed appended by symbols that
have their sizes relative to the cell value and some visual attributed such as
shape, background ll and color ll etc., used to encode additional
information necessary for visual understanding and inspection of collinearity.
In rst column of the tableplot, the symbols are scaled relative to a maximum
CI of 30, while in remaining columns, variance proportions are scaled relative
to a maximum of 100.
• Collinearity Biplot
Friendly and Kwan (2003) also proposed a method through collinearity biplot
to visualize the contribution of regressors to multicollinearity. The standard
biplot (Gabriel, 1971; Gower and Hand, 1996) can be considered as
multivariate scatter-plot, obtained by projecting multivariate sample into a
low-dimensional space to account for the greatest variance in the data. In
biplot, (i) the mean value of each regressors is origin of the variable vector
and points in the direction of positive deviations from the average of each
variable, (ii) the angle between regressors depicts the degree of relationships
between regressors, (iii) the angles between each regressors and the biplot
axes approximate the relationship among regressors, (iv) because the
regressors were scaled to unit length, the relative length of each regressor
indicates the proportion of variance represented in the low-rank
approximation, (v) the orthogonal projections of the observation points on
the regressors show approximately the value of each observation on each
regressor, and (vi) the observations indicated as principal component scores
42
are uncorrelated by construction.
But still the standard biplot is less useful for visualizing the relations among
regressors contributing to near multicollinearity. Biplot of smallest dimensions
shows these relations directly and can be used to show other features of data
such as outliers, and leverage points.
• VIF and Eigenvalue Plots
Graphical representation of VIF values for each regressors can be used to
detect existence of collinearity graphically. Similarly, eigenvalues can be
plotted. Larger values of VIF or smaller eigenvalues can be depicted from
vertical axes for each regressors.
3.5.2 Numerical Method of Diagnostics
Following are dierent available numerical diagnostic measures used for detection
of multicollinearity in exiting literature provided or discussed by various authors
(Belsley et al., 1980; Curto and Pinto, 2011; Farrar and Glauber, 1967; Fox, 1986;
Greene, 1993; Gunst and Mason, 1977; Klein, 1962; Koutsoyiannis, 1977; Kovács
et al., 2005; Marquardt, 1970; Theil, 1971).
Widely used and most suggested diagnostics are value of pair-wise correlations, VIF,
TOL, eigenvalues and vector, CN & CI, Leamer's method, Klein's rule, tests proposed
by Farrar and Glauber (Farrar and Glauber, 1967), Red Indicator and Theil's measure
etc. Some details about these tests is described below.
1. High Correlation between Exogenous Variables
If zero-order or pairwise correlation coecient between two regressors is high
(say >0.8) then multicollinearity may be a serious problem (Gujarati and
Porter, 2008; Maddala, 1988), but it is not sucient and necessary condition
for the detection of multicollinearity because of linear dependencies existing
between regressors (Judge et al., 1985). Also multicollinearity may exist even
43
though the pairwise correlations are comparatively low (say <0.5). However,
multicollinearity may be harmful if rij ≥ R2 (Huang, 1970).
2. High R2 and low t-ratios
High R2 (say >0.8) may be considered as classic symptom of harmfulness of
multicollinearity (c.f. Gujarati and Porter, 2008). In most of the cases, overall
F -test rejects the null hypothesis of partial slope are simultaneously equal to
zero, but some or all individual t-ratio of partial slope will be non-signicant.
The weakness of this diagnostic is that "it is too strong in the sense that
collinearity is regarded as harmful destructive only when all of the inuences of
regressors on y (response variable) cannot be disentangled" (c.f. Gujarati and
Porter, 2008). A model having no multicollinearity problem, having high R2,
should also have high t-ratios of coecients.
3. The FarrarGlauber test
Farrar and Glauber (1967) suggested three set of statistical tests for testing
multicollinearity. The rst test is Chi-square test for detection strength of
multicollinearity over the complete set of regressors,
χ2 = −[n− 1− 1
6(2k + 5)].loge [value of standardized determinant] .
Second test is an F -test for locating the variables which are collinear by
computing the multiple correlation coecients among explanatory variables
F∗ =(R2
xj .x1x2···xk)/(k − 1)
(1−R2xj .x1x2···xk)/(n− k)
, j = 1, 2, · · · , p
and third test is for nding out the pattern of multicollinearity by nding the
partial correlation coecients among regressors,
t∗ =(rxixj .x1x2···xk)
√n− k√
1− r2xixj .x1x2···xk.
44
Studying partial correlation may be useful, however there is no guarantee that
partial correlations will provide an infallible guide to multicollinearity because
both the R2 and all the partial correlation may be suciently high.
Note that FarrarGlauber test of multicollinearity is based only on the
correlation or partial correlation coecient of regressors and make no use of
overall R2.
Wickers (1975) showed that the Farrar and Glauber (1967) partial correlation
test is inecient as given partial correlation may be compatible with dierent
multicollinearity pattern.
4. Determinant
X ′X matrix will be singular matrix (can't be inverted) if it contains linearly
dependent columns or rows. Therefore, it is better to calculate the
determinant of matrix X ′X (constant term is not included). Determinant of
normalized correlation matrix |X ′X| closer to zero indicate perfect
multicollinearity, while small value of determinant will indicate almost
singular matrix or near multicollinearity (Asteriou and Hall, 2007).
Determinant on the scale is 0 ≤ |X ′X| ≤ 1 (see Cooley and Lohnes, 1971).
This diagnostic is a very weak measure of harmfulness of (multi)collinearity.
It is better to use some other diagnostic measures that reect the sensitivity
of the parameters with respect to small changes in X ′X. Determinant does
not provides information about interdependence between regressors, it only
provide information about singularity (departure from orthogonality) of a
correlation matrix.
5. Variance Ination Factor (VIF) and Tolerance
The VIF terminology was introduced by Marquardt (1970), that measures how
much the variance of estimated regression coecients are increased over the
case of no correlation among p regressors.
45
The diagonal elements of C = (X ′X)−1p×p matrix are considered as very
important in detecting multicollinearity. The jth diagonal element of C can
be represented as Cjj =(1−R2
j
)−1, where R2
j is the coecient of
determination when regressor Xj is regressed on the remaining (p − 1)
regressors. R2j will be small and Cjj be close to one, when Xj is orthogonal or
nearly orthogonal to the other remaining (p − 1) regressors. In case, if the
regressor Xj is nearly dependent on some of the remaining (p − 1) regressors,
R2j will be near to one and Cjj will be large enough.
Collinearity of regressor Xj with remaining (p− 1) regressors increases as VIF
increases and it can be innite. If VIF of a variable exceeds 10 (happens when
R2j>0.9) that variable is said to be highly collinear (Kleinbaum et al., 1988).
It is better to examine the square root of VIF instead the VIF, because the
precision of estimation of βj is proportional to the standard error of βj not on
its variance (Stewart, 1987). Tolerance (TOL) can also be used as a measure
of existence of multicollinearity in view of its intimate connection with V IFj.
TOLj =1
V IFj= (1−R2
j )
It can be seen from formula that closer the TOLj to zero, the greater the degree
of collinearity of that variable with the other regressors, in other words it can
be said that if TOLj is closer to 1, the greater evidence that regressor Xj is
not collinear with the other remaining regressors.
Although VIF gives a good measure of multicollinearity, but still it is unable to
elucidate the structure of several existing near dependencies among regressors.
One advantage of using R2j = 1− 1
V IFis that it does not depend on scaling of
data. It is not aected by raw data, centered data and even standardized data.
However, the criticism on VIF is that V ar(βj) = σ2∑x2jV IFj depends on σ2,∑
x2j and V IFj, which shows that a high VIF can be counterbalanced by a
46
low σ2 or high∑x2j . So a high VIF is neither necessary nor sucient measure
of multicollinearity. Because of its simplicity and direct interpretation, VIF or
square root of VIF is considered as the principle diagnostic for the detection
of multicollinearity. That's why approximately all of the statistical software
report VIF and/or TOL in their regression output.
Multicollinearity can be detected by examining VIF and condition indices
(Neter et al., 1989), therefore, examination of the eigenvectors, corresponding
to small singular values should be done.
6. Sensitivity of Parameters
Simple regression yields the same parameter estimation than multiple
regression in case of zero multicollinearity. The problem of multicollinearity is
proportional to the sensitivity of the parameters with respect to the addition
of new regressors. Similarly, a slight change in data, in case of not perfect
multicollinearity, estimation of coecients is although possible but the
estimates and their standard errors becomes very sensitive. All this can be
used for detection of possible multicollinearity.
7. Auxiliary Regression
Using auxiliary regression, R2 designated by R2j is computed by regressing each
Xj on the other remaining X variables. Relationship between F and R2 for
each auxiliary regression is built by
Fj =
R2xj.x2,x3,··· ,xp
(p−2)1−R2
xj.x2,x3,··· ,xp(n−p−2)
∼ F ∗(p− 2, n− p+ 1)
where n is sample size, p is number of regressors including intercept
and R2xj .x2,x3,··· ,xp is coecient of determinant in the regression of variable Xj
on the remaining X variables.
If the computed F exceeds the critical Fj (Fj > F ∗)at the chosen level of
47
signicance, then it means that the regressor Xj is collinear with other
regressors. If Fj is statistically signicant, the particular Xj should be
dropped from model need some decision.
8. Klein's Rule of Thumb
Klein (1962) rule of thumb is that multicollinearity may be a dicult to detect,
if the R2j from an auxiliary regression is larger than the overall R2 (obtained
from the regression of y on all the regressors) Greene (1993).
9. Eigenvalues and Eigenvectors
Eigenvalue and eigenvectors of X ′X or its related correlation matrix R, were
used in dealing with multicollinearity for many years. Kloek and Mennes (1960)
depicted several ways of using principal components of X or related matrix to
reduce some ill eects of multicollinearity. For diagnostic purpose, Kendall
(1957) and Silvey (1969) suggested the use of eigenvalues of X ′X to check the
presence of multicollinearity by setting the criteria that small eigenvalue (near
to zero) is an indication of high collinearity, but did not mentioned how much
small should be the eigenvalue.
From eigenvalues, we can compute condition number κ, dened as κ =λmaxλmin
and condition index (CI) also called complaint number is dened as
CI =√κ =
√λmaxλmin
If the value of κ is between 100 and 1000 there is moderate to strong
multicollinearity and if it exceeds 1000 there is severe multicollinearity.
Alternatively, if CI =√κ is between 10 and 30, there is moderate to strong
multicollinearity and if it exceed 30 there is severe multicollinearity (Belsley,
1991).
Similarly, incremental percent of an eigenvalue to the total is also used for
detection of existence of collinearity among regressors. Incremental percent
48
value near to 0 indicates that data are collinear.
10. The sum of λ−1i
Investigation of eigenvalues and eigenvectors of the X ′X matrix helps in
assessing the degree of multicollinearity. In an orthogonal system
p∑j=1
λ∗j =
p∑j=1
λ∗−1j = p
where λ∗j correspond to the p eigenvalues of the correlation matrix R∗ = Ip×p.
Therefore, for a sample based correlation matrix R with eigenvalues λj, j =
1, 2, · · · , p, we can compare p as∑p
j=1 λ−1j .
Larger the values of∑p
i=1 λ−1j (say ve times the number of predictor variables)
indicate severe collinearity (Dillon and Goldstein, 1984; Chatterjee and Hadi,
2006).
11. Leamer's Method
Leamer (in Greene, 1993) suggested the following measure of the eect of
multicollinearity for the jth variable:
cj =
(∑
i
(Xij −Xj)2
)−1(X ′X)−1jj
12
where (X ′X)−1jj is the jth element of the matrix (X ′X)−1. This suggested
measure is the square root of the ratio of the variances of estimated coecients
(βj), when estimated without and with the other variables. IfXj is uncorrelated
with the other variables, cj would be 1, otherwise, cj will be equal to (1−R2j )
12 .
12. Theil's Measure
Theil (1971) proposed a measure of multicollinearity based on incremental
contribution (R2 − R2−j) to the squared multiple correlation, where R2
−j is the
49
R2 of the regression of the response variable on all the regressors excluding
Xj. Specically, the multicollinearity eect was measure by
m = R2 −p∑j=1
(R2 −R2−j)
If Theil's measure is zero, then all X's are mutually uncorrelated as the
incremental contributions all add up to R2. The m can be negative or highly
positive making dicult to use it for any guidance.
13. Red Indicator
Kovács et al. (2005) presented a synthetic and new normalized indicator for
diagnostic of multicollinearity by using eigenvalues or quantifying the average
correlation of the data. Since X ′X = R is a symmetrical matrix, the sum of
squares of the eigenvalues with spectral decomposition of this R matrix equals
the sum of squares of the matrix element (Peter Kovács,)
2∑j=1
λ2j =
p∑i=1
p∑j=1i 6=j
r2ij.
The greater the dispersion of eigenvalues, the greater will be the correlation of
the regressors. The extent of dispersion in eigenvalues is used to quantify the
extent of redundancy and is dened as
Red =
√p∑j=1
(λj − 1)2
p√p− 1
Red indicator of value zero (or zero percent) means the absence of redundancy
while the Red indicator value near to 1 means maximum redundancy. Red
indicator can be used to compare two or more redundant data, but cannot
make a direct conclusion about which one is more useful data. Red indicator
50
can be computed without knowing eigenvalues, because Red indicator value is
the quadratic mean of the elements outside the main diagonal of the correlation
matrix R, i.e.
Red =vλ√p− 1
=
√p∑i=1
(λj − 1)2
p√p− 1
=
√√√√√√p∑i=1
p∑j=1i 6=j
r2ij
p(p− 1)
Red indicator is a synthetic indicator as it quanties the average correlation
matrix of the entire data. Moreover, in comparison to known collinearity
diagnostic measures, the Red indicator reports correlation more precisely both
in quality and in size.
14. The Corrected VIF (CVIF)
Curto and Pinto (2011) proposed a new measure of multicollinearity diagnostic
to evaluate the impact of the correlation among regressors in the variance of the
OLSEs, named it as corrected variance ination factor (CVIF). The traditional
VIF overestimates when regression variables contain no redundant information
about the dependent variable.
CV IFj = V IFj ×1−R2
1−R20
,
where, R20 = R2
yx1+ R2
yx2+ · · · + R2
yxp , R2 is coecient of determination from
regression of y on all regressors, and V IFj is VIF values.
They set the rule of thumb (CV IFj ≥ 10) to decide when the variance
magnication eect is serious for the coecients βj.
We classied the collinearity diagnostics as overall and individual measures of
collinearity. The overall diagnostics measure helps to get idea about
existence/non-existence of collinearity among all regressors and results in a single
number, while individual diagnostics measure try to detect the existence/
51
non-existence of collinearity for each of the regressors. For simulated and existing
collinear data sets, and comparison of overall and individual measures of
collinearity see Table 3.1, Table 3.2 and Table 3.3. For listing of all collinearity
diagnostics with suggested detection criteria from dierent researchers with
corresponding references, see Appendix C.
3.6 New Proposed Diagnostics
The existence or detection of multicollinearity among regressors should always be
tested whenever examining a data set, so that its adverse eect and pitfall that may
exist in regression model may be avoided Kmenta (1980). Various graphical and
numerical diagnostic measures for quantication of multicollinearity are available
in literature as discussed above. However, none of the existing methods serves as
a synthetic and normalized indicator of multicollinearity (see Belsley et al., 1980;
Chen, 2012; Curto and Pinto, 2011; Green et al., 1978; Gujarati and Porter, 2008;
Kovács et al., 2005; Silvey, 1969; Ukoumunne et al., 2002).
We proposed two new diagnostics for measure of collinearity and depend on R2 and
R2adj values. The existing collinearity diagnostics heavily depends on either R
2 and/or
eigenvalues or some relation between R2 and eigenvalues, whereas R2, eigenvalues and
correlation among regressors are considered as an important collinearity detection
criteria.
The proposed collinearity detection measures depend on R2 and adjusted R2 (R2adj)'s
values. Using empirical results from regression analysis of correlated and uncorrelated
regressors by following the Monte Carlo scheme for dierent levels of correlation
among regressors with various samples sizes, it is tried to set some threshold for our
new proposed diagnostic measures.
The R2 indicates that how well data t a statistical model as it is the proportional
explained variation in dependent variable due to independent variables. The higher
52
the R2 value, the more chances of regressors to be plagued with multicollinearity (see
Asteriou and Hall, 2007; Gujarati and Porter, 2008; Maddala, 1988). The R2 is a
monotone non-decreasing function of number of regressors included in the model. It
means that R2 inates the estimate of how well the regression ts the data (Gujarati
and Porter, 2008; Stock and Watson, 2010). The R2adj is a modied version of R2
(due to Theil, 1961) that adjusts for number of regressors in a model relative to the
number of data points and hence it is an attempt to take account of the phenomenon
of spuriously increasing R2 automatically when extra regressors are added to model
Stock and Watson (2010). It deates the R2 by some factor i.e. n−1n−p−1 . For p > 1,
R2adj ≤ R2, implies that as the number of regressor(s) increases, the R2
adj increases
less than the (un-adjusted) R2, because R2 is aected by regressors sharing their
variances i.e., linear dependence exists among regressors (Gujarati and Porter, 2008;
Maddala, 1988). The above discussion about R2 and R2adj is the main reason to
consider R2adj in our new proposed diagnostic measures.
From empirical results of the Monte Carlo experiment, ination (spurious increase) in
R2 values due to addition of regressor(s) in model and deation in R2 by factor n−1n−p−1 ,
we suggest to take dierence of R2 and R2adj from auxiliary regression of regressors to
account the sharing of variances due to dierent regressors in each auxiliary regression
run, for the detection of multicollinearity (see Asteriou and Hall, 2007; Gujarati and
Porter, 2008; Maddala, 1988, for details of auxiliary regression). The dierence of
R2j and Rj .adj
2 is used as a new diagnostic measure and is referred to as Indicator 1
(IND1) for further discussion.
IND1 = R2j −Rj .adj
2 =(n− 1)(1−R2
j )
n− p+R2
j − 1,
= (R2j − 1)×
(1− pn− p
), (3.1)
where R2j and Rj .adj
2 are from the auxiliary regression of each explanatory variables.
For the simulated collinear and non-collinear data, using auxiliary regression, we
53
empirically, found that smaller the dierence or alternatively closer the value of
R2j and Rj .adj
2 i.e (R2j − Rj .adj
2 ≤ 0.020), greater the chances of multicollinearity.
Alternative, larger the value of (R2j−Rj .adj
2)−1 ≥ 50, more severe the multicollinearity
will be. This dierence of R2j and Rj .adj
2 from auxiliary regression of explanatory
variables lies in an interval [0.0104, 0.0418] for dierent sample size and correlation
level between generated regressors. Any of the extreme dierence value from the
interval can be used as criterion but we used central value (average value of dierences
for all sample size and correlation values) which is approximately 0.020.
From Eq. (3.1), as n → ∞, IND1 approaches to 0. Therefore, multicollinearity is
detected when IND1 < c for n < 100
IND1 <c
n× 100 for n > 100
The second diagnostic tool is the ratio of each R2 from the auxiliary regression (R2j ) to
the mean of all R2j (j = 1, 2, · · · , p) from the auxiliary regression i.e.,
R2j
m, where m =
p∑j=1
R2j
p. If this ratio for jth variable is greater than R2 (from regression of y on X's)
then the jth regressor will be highly collinear with others regressors. In denominator
of this diagnostic mean of all R2j (m) gives the average sharing of variances among
regressors accounted by using auxiliary regression for jth regressor as dependent
variable on the remaining regressors, whereas the distribution of R2j for dierent
sample size and correlation level between variable was found to be approximately
normally distributed. Note that if correlation among regressors is small then this
proposed indicator (say IND2 for further reference) will give false positive detection
of collinearity, as magnitude of R2J
mwill be larger than the average of R2
j 's (j =
1, 2, · · · , p) in this case. Since the classic symptom of multicollinearity is R2 ≥ 0.7,
therefore, to avoid the false positive detection of multicollinearity, the IND2 species
54
multicollinearity when
|R2j−1|m
> R2, if 0.70 ≥ R2 < 0.80
R2j
m> R2, if R2 > 0.80
no collinearity if R2 < 0.70
We compared the existing and proposed multicollinearity diagnostic tools for their
detection performance under dierent level of correlation and sample size.
3.7 Numerical Evaluation
For the numerical evaluation of dierent diagnostic measures of multicollinearity, we
have followed the similar Monte Carlo schemes used by many other researchers (see,
e.g., Aslam, 2014b; Clark and Troskie, 2006; Månsson et al., 2010; McDonald and
Galarneau, 1975; Newhouse and Oman, 1971, etc). The simulation deals with six
parameter case. The regressors are computed as
xij = (1− ρ)1/2zij +√ρZi7; i = 1, 2, · · · , n, j = 1, 2, · · · , 6,
where zi1, zi2, · · · , zi7 are independent standard normal pseudorandom numbers, and
correlation between any regressors is given by ρ. Without loss of generality, these
variables are standardized so thatX ′X form a usual correlation matrix. Five dierent
sets of correlations are considered corresponding to ρ = 0.7, 0.8, 0.9, 0.95, 0.99. The
values of such generated predictors are kept xed for simulation.
The sample size (n) is set to 50, 100, 150, 200. The number of Monte Carlo
replications is set to be 5000. In addition to this simulation study, for illustration
purpose, dierent diagnostic measures were evaluated on some popular collinear
data sets available in few previous studies (see Hald, 1952; Longley, 1967;
Malinvaud, 1968). All the calculations are performed making routines in the
55
R-Language.
Table 3.1 contains the simulated results for the overall measure of collinearity
diagnostics in percentage of detection that indicates existence of collinearity among
all the regressors. It can be seen that the determinant X ′X, the Farrar-Glauber
chi-square (FGC) test, red indicator and Theil's measure detect collinearity
correctly than the CI and sum of reciprocal of eigenvalues for all ρ > 0.8 and for
dierent sample size (n = 50, 100 and 200) while only determinant detects the
collinearity poorly for ρ = 0.7. Percentage of detection by the CI is lowest than all
the other overall diagnostics for dierent sample sizes, but it detects well as ρ
increases than ρ = 0.90 while for ρ ≥ 0.95 detection becomes 100% for all sample
sizes. For ρ = 0.7 and ρ = 0.8, the sum of reciprocal of eigenvalues detects existence
of collinearity among regressors at low percentage, but relatively much higher than
that by the CI. The FGC and Theil indicator successfully diagnose the collinearity
between regressors.
Table 3.2 contains simulated results of collinearity diagnostics for each regressor
Xj, referred to as individual measure of diagnostics. For ρ ≥ 0.90 and sample
size n = 50, 100 and 200, all diagnostics successfully detect the collinearity among
regressors Xj, except VIF/TOL, Leamer's measure and CVIF. For correlation level
ρ = 0.70 and 0.8, the diagnostic measures VIF/TOL, CVIF and Leamer's method
could not successfully detect the existence of collinearity among regressors. For
sample size of 50, the percentage of detection by VIF/TOL and Leamer's method
(when ρ = 0.7) are less than approximately 4% and 17%, respectively. For sample
of size 100, the percentages of detection by VIF/TOL and Leamer's method (when
ρ = 0.70) are less than 1% and 4.2%, respectively. Similarly, for ρ = 0.80, percentage
detection is less than 25% and 71%, respectively. Percentage of collinearity detection
by CVIF indicator is smaller as compared to the other indicators, while percentage
of detection by this indicator for sample of size 50 and ρ = 0.70 is less than 1%,
whereas, the percentage of detection increases as correlation among regressors and
56
Table 3.1: Percentage detection of collinearity by overall diagnostics measures
n Indicatorsρ
0.7 0.8 0.9 0.95 0.99
50
Determinant 61.34 99.10 100 100 100
FGC 100 100 100 100 100
Red Indicator 99.90 100 100 100 100
CI 0.00 0.16 44.62 99.32 100
Theil 99.98 100 100 100 100
Sum of reciprocal of eigenvalues 0.46 39.54 99.76 100 100
100
Determinant 54.34 99.86 100 100 100
FGC 100 100 100 100 100
Red Indicator 100 100 100 100 100
CI 0.00 0.00 10.16 100 100
Theil 100 100 100 99.82 100
Sum of reciprocal of eigenvalues 0.00 19.48 100 100 100
200
Determinant 48.60 100 100 100 100
FGC 100 100 100 100 100
Red Indicator 100 100 100 100 100
CI 0.00 0.00 0.28 99.90 100
Theil 100 100 100 100 100
Sum of reciprocal of eigenvalues 0.00 5.42 100 100 100
the sample size both increases. It is worthy to note that the percentage of detection
decreases with the increase of sample size which follow the theory that collinearity
among regressors reduces with the increase of sample size.
Our proposed collinearity diagnostics (IND1 and IND2) detect 100% existence of
collinearity among regressorsXj for dierent sample sizes and correlation level. When
the regressors are collinear at ρ = 0.70 and sample size of 50, 100 and 200, the
percentage of collinearity detection is less than 65%, 75% and 84% respectively by
by IND1, while IND2 detects 100% existence of collinearity for dierent correlation
57
level and sample sizes. For ρ ≥ 0.80, the percentage of detection is about 100%.
Thus, when collinearity is needed to be detected rightly, the new proposed measure
do it correctly.
We also performed simulation on very large sample size (n = 500, 1000, and 2000)
with very high or low correlation level (ρ = 0.1, 0.3, and 0.5) among regressors. For
example when n = 100 and ρ = 0.30, the overall diagnostic tool Theil's measure and
FGC still results in 100% false positive collinearity detection. Among the individual
diagnostic measures, the Farrar wi, F -test and Klein's rule detected collinearity in
most of the cases, reecting very high false positive rate. On the other hand, the new
proposed indicators IND1 and IND2 also detect collinearity about 10% of the times.
These results are not presented due to huge volume of diagnostic's output.
In Table 3.3, we tested all collinearity diagnostics on already existing and tested
data available in the literature. The results indicate that whether dierent
collinearity diagnostic tools detected the collinearity or they failed to detect the
collinearity among regressors for three dierent existing datasets already available
in the literature. The datasets by Longley (1967), Malinvaud (1968) and Hald
(1952), extremely plagued with multicollinearity, were used. All of the overall
diagnostic measures successfully detected the existence of collinearity among
regressors for these datasets except Theil's measure for Malinvaud data set.
Individual diagnostic measures, Klein's rule and CVIF failed to detect the
collinearity among regressors for the Longley and Hald datasets. However, Farrar
and Glauber's wi and F -test also detected the existence of collinearity due to
regressors x5 for the Longley dataset that was not reported by other indicators.
Our proposed indicators (IND1 and IND2) correctly detected the existence of
collinearity among regressors for all three datasets. Correct detection by these new
indicators also followed the results from the existing literature.
Note that, Farrar and Glauber's wi, F -test, FGC, Klein's rule and CVIF may not be
preferred because Farrar and Glauber's tests are criticized by many researchers (see
58
Table3.2:
Percentagedetectionof
collinearityby
individualdiagnostics(for
each
regressorXj)
Indicators
ρ=
0.7
ρ=
0.8
ρ=
0.9
ρ=
0.9
5ρ=
0.9
9
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
X1
X2
X3
X4
X5
X6
n=
50
VIF
3.20
2.98
3.50
3.40
3.34
3.54
38.58
38.86
38.26
39.96
38.94
39.00
98.94
99.08
98.94
98.86
99.06
98.96
100
100
100
100
100
100
100
100
100
100
100
100
TOL
3.20
2.98
3.50
3.40
3.34
3.54
38.58
38.86
38.26
39.96
38.94
39.00
98.94
99.08
98.94
98.86
99.06
98.96
100
100
100
100
100
100
100
100
100
100
100
100
Farrarw
i100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
Leamer
15.72
15.56
16.76
16.32
16.12
16.10
71.32
73.40
71.24
72.54
71.94
71.30
99.96
99.92
99.94
99.92
99.98
99.94
100
100
100
100
100
100
100
100
100
100
100
100
F-test
100
99.98
99.98
99.98
100
99.98
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
Klein
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
CVIF
0.44
0.44
0.50
0.42
0.44
0.46
2.30
2.14
2.06
2.26
2.34
2.32
40.66
40.10
39.60
40.50
40.52
40.32
99.70
99.7099.70
99.70
99.70
99.70
99.62
99.62
99.62
99.62
99.62
99.62
IND1
64.5664.8265.6265.2864.8064.8497.6897.4697.5897.5497.7097.84
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
IND2
99.8499.8299.7899.8899.8099.9299.8699.9699.9099.8299.9299.8499.9499.9899.9899.9099.9099.9699.98
100
100
99.98
100
99.9899.9699.9899.9099.9099.9499.90
n=
100
VIF
0.16
0.08
0.10
0.28
0.18
0.10
23.86
24.92
24.04
24.52
24.04
24.58
99.90
99.78
99.88
99.88
99.94
99.90
100
100
100
100
100
100
100
100
100
100
100
100
TOL
0.16
0.08
0.10
0.28
0.18
0.10
23.86
24.92
24.04
24.52
24.04
24.58
99.90
99.78
99.88
99.88
99.94
99.90
100
100
100
100
100
100
100
100
100
100
100
100
Farrarw
i100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
Leamer
3.90
3.42
4.10
3.96
4.18
3.92
70.84
71.40
70.56
70.38
70.54
70.46
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
F-test
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
Klein
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
CVIF
00
00
00
0.06
0.06
0.04
0.06
0.06
0.06
26.96
27.18
26.50
26.68
26.74
26.68
99.82
99.7099.62
99.56
99.76
99.7499.9899.9899.9899.9899.9899.98
IND1
73.7273.9274.2472.7473.8074.2699.8499.8299.8299.8499.7899.90
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
IND2
100
100
100
100
99.98
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
n=
200
VIF
00
00
00
11.92
11.62
12.12
11.52
12.32
12.16
100
100
99.98
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
TOL
00
00
00
11.92
11.62
12.12
11.52
12.32
12.16
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
Farrarw
i100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
Leamer
0.54
0.22
0.26
0.20
0.44
0.38
73.32
72.44
72.58
72.70
71.84
72.60
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
F-test
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
Klein
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
CVIF
00
00
00
0.08
0.12
0.16
0.12
0.12
0.12
13.16
14.02
13.34
13.45
13.24
14.36
99.98
99.9699.98
100
99.98
100
100
100
100
100
100
100
IND1
83.0083.2484.0683.1283.5683.38
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
IND2
100
100
100
100
99.98
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
59
Table 3.3: Collinearity detection by overall and individual indicators for existingcollinear datasets
Diagnostic Data Set Indicators Results *
Overall
Longley
Determinant 1Farrar χ2 1
Red Indicator 1CI 1
Theil 1Sum of reciprocal of eigenvalues 1
X1 X2 X3 X4 X5 X6
Individual
VIF 1 1 1 1 0 1TOL 1 1 1 1 0 1
Farrar wi 1 1 1 1 1 1Leamer 1 1 1 1 0 1F-test 1 1 1 1 1 1Klein 1 0 1 0 0 1CVIF 0 0 0 0 0 0IND1 1 1 1 1 0 1
IND2 1 1 1 1 0 1
Overall
Malinvaud
Determinant 1Farrar χ2 1
Red Indicator 1CI 1
Theil 0Sum of reciprocal of eigenvalues 1
Individual
X1 X2 X3VIF 1 0 1TOL 1 0 1
Farrar wi 1 0 1Leamer 1 0 1F-test 1 0 1Klein 1 0 1CVIF 1 0 1IND1 1 0 1
IND2 1 0 1
Overall
Hald
Determinant 1Farrar χ2 1
Red Indicator 1CI 1
Theil 1Sum of reciprocal of eigenvalues 1
X1 X2 X3 X4
Individual
VIF 1 1 1 1TOL 1 1 1 1
Farrar wi 1 1 1 1Leamer 1 1 1 1F-test 1 1 1 1Klein 0 1 0 1CVIF 0 0 0 0IND1 1 1 1 1
IND2 1 1 1 1
* 1 indicates that collinearity is detected by indicator while 0 indicates no collinearity
60
Haistovsky, 1969; Kumar, 1975; O'Hagan and McCabe, 1975) because of statistical
properties and high false positive detection by these diagnostic measures.
3.8 mctest: An R Package for Collinearity Detection
In this section, we illustrate the use of our developed R package mctest
(Imdadullah and Aslam, 2016) for the detection of existence of collinearity among
regressors. Popular and widely used diagnostic measures that are already available
in existing the literature and our proposed diagnostic measures are implemented in
our developed package named mctest. This package contains functions namely
imcdiag, omcdiag, mc.plot and mctest. For detection of collinearity using
numerical methods, see Section 3.7. imcdiag function can be used to compute and
list individual diagnostic measures, while omcdiag function displays overall
collinearity diagnostics, and mc.plot function draws VIF and eigenvalues plots.
Functions imcdiag and omcdiag not only compute individual and overall diagnostic
measures respectively but also return results with indication of either collinearity
exist or not among regressors for overall diagnostic measures, while results from
individual diagnostic measures also indicate that which regressors may be the
reason of multicollinearity. The package can be downloaded from
https://cran.r-project.org/web/packages/mctest/.
The mctest package must be installed and loaded in the system memory, that is,
> install.packages("mctest")
Therefore, after the mctest package is installed successfully, it needs to be loaded,
so that package functions become accessible in the current R session, that is,
> library(mctest)
For package help and working examples of available functions, use the following
command,
> help("mctest")
61
The main function mctest is used to display overall, individual or both diagnostic
measures of collinearity. The syntax of mctest is,
mctest(x, y, type=c("o","i","b"), na.rm = TRUE , Inter=T, method=NULL
, corr=FALSE , detr =0.01, red=0.5, theil =0.5, cn=30, vif=10, tol
=0.10 , conf =0.95 , cvif=10, ind1 =0.02, ind2 =0.7, leamer =0.1, ...)
The description of mctest's arguments is;
Table 3.4: Description of mctest Package's arguments
Argument Description
x A numeric matrix of regressors and should contain two or more regressory A numeric vector of response variabletype For choice of overall, individual or both types of diagnostic measures.
Option "i" prints individual, "o" overall while "b prints both(individual and overall) diagnostic measures.
na.rm whether to remove missing observation. By default, missingobservations will be removed from data
Inter Inclusion or exclusion of intercept term in X matrix for eigenvalues andCN.
method For choice of certain individual diagnostic measures. Following optionsare for the choice of dierent individual measures such as "VIF", "Wi","Fi", "Leamer", "CVIF", "IND1", "IND2", "Klein". If no optionis selected all of the individual diagnostic measures will be printed.
corr Whether to display correlation matrix. By default, correlation matrixwill not be printed.
detr default threshold of Determinant detr = 0.01
red default threshold of Red indicator red = 0.50
theil default threshold of Theil's indicator theil = 0.50
cn default threshold of Condition Number cn = 30
vif default threshold of VIF measure vif = 10
tol default threshold of TOL measure tol = 0.10
conf default condence level for Farrar's tests conf = 0.95
cvif default threshold for CVIF measure cvif = 10
ind1 default threshold for IND1 indicator ind1 = 0.02
ind2 default threshold for IND2 indicator ind2 R2 = 0.70
leamer default threshold for Leamer's method leamer = 0.10
... extra arguments if used will be ignored.
62
3.8.1 Collinearity Detection using mctest Package
For the detection of collinearity among regressors, we used the Hald data as an
example, which is already bundled in mctest package and can be loaded by using
data command. After loading the data in computer memory, the regressors and
response variables are stored in x and y variables respectively as given below,
> data(Hald)
> x <- Hald[,-1] # regressors
> y <- Hald[, 1] # response variable
3.8.1.1 Overall Collinearity Diagnostics
For computation of overall diagnostics, following commands with dierent
argument(s) can be used to get results with indication of existence of collinearity
among regressors.
> mctest (x, y)
> mctest (x, y, Inter=FALSE)
> mctest (x, y, cn=20, detr =0.001)
> mctest (x, y, type="o")
The results from command mctest(x,y, cn=20, detr=0.001) are;
Call:
omcdiag(x = x, y = y, Inter = TRUE , detr = detr , red = red , conf =
conf , theil = theil , cn = cn)
Overall Multicollinearity Diagnostics
MC Results detection
Determinant |X'X|: 0.0011 0
Farrar Chi -Sqaure: 59.8700 1
Red Indicator: 0.5414 1
Sum of Lambda Inverse: 622.3006 1
Theil 's Method: 0.9981 1
Condition Number: 249.5783 1
1 --> COLLINEARITY is detected
0 --> COLLINEARITY in not detected by the test
====================================
Eigven values with INTERCEPT
63
Intercept X1 X2 X3 X4
Eigven Values: 4.1197 0.5539 0.2887 0.0376 0.0001
Condition Indeces: 1.0000 2.7272 3.7775 10.4621 249.5783
The mctest function calls omcdiag function for overall collinearity diagnostics
measures, therefore omcdiag can also be used alternatively. Following is example
that produces diagnostics under provided threshold while eigenvalues will be
displayed without intercept term.
> omcdiag(x, y, Inter=FALSE , red=.6, theil=.4, cn=25, conf =0.99)
> omcidag(x, y, Inter=FALSE)
The results from omcidag(x, y, Inter=FALSE) command, where Inter argument
is set to FALSE are;
Call:
omcdiag(x, y, Inter = FALSE)
Overall Multicollinearity Diagnostics
MC Results detection
Determinant |X'X|: 0.0011 1
Farrar Chi -Sqaure: 59.8700 1
Red Indicator: 0.5414 1
Sum of Lambda Inverse: 622.3006 1
Theil 's Method: 0.9981 1
Condition Number: 9.4325 0
1 --> COLLINEARITY is detected
0 --> COLLINEARITY in not detected by the test
===================================
Eigen values without INTERCEPT
X1 X2 X3 X4
Eigven Values: 3.1231 0.5535 0.2883 0.0351
Condition Indeces: 1.0000 2.3754 3.2911 9.4325
3.8.1.2 Individual Collinearity Diagnostics
Individual collinearity diagnostic measures can be obtained by using the type
argument in mctest function. If argument method is not used, all individual
diagnostics will be displayed. For example,
64
> mctest (x , y , type=" i " )
Ca l l :
imcdiag (x = x , y = y , method = method , co r r = FALSE, v i f = v i f , t o l=to l , conf = conf
, c v i f = cv i f , ind1 = ind1 , ind2 = ind2 , leamer = leamer )
Ind i v i dua l Mu l t i c o l l i n e a r i t y D iagnos t i c s
VIF TOL Wi Fi Leamer CVIF IND1 IND2 Klein
X1 38.4962 0 .0260 112.4886 187.4811 0 .1612 −0.5846 0 .0087 0 .9875 0
X2 254.4232 0 .0039 760.2695 1267.1158 0 .0627 −3.8635 0 .0013 1 .0099 1
X3 46.8684 0 .0213 137.6052 229.3419 0 .1461 −0.7117 0 .0071 0 .9923 0
X4 282.5129 0 .0035 844.5386 1407.5643 0 .0595 −4.2900 0 .0012 1 .0103 1
1 −−> COLLINEARITY i s detec ted
0 −−> COLLINEARITY in not detec ted by the t e s t
X1 , X2 , X3 , X4 , c o e f f i c i e n t ( s ) are non−s i g n i f i c a n t may be due to
mu l t i c o l l i n e a r i t y
∗ use method argument to check which r e g r e s s o r s may be the reason o f c o l l i n e a r i t y
===================================
To get specic individual diagnostics such as VIF, Leamer's method or IND1 etc.,
use method argument. for example,
> mctest(x, y, type="i", method="VIF")
> mctest(x, y, type="i", method="IND1")
> mctest(x, y, type="i", method="CVIF", cvif =5)
> imcdiag(x, y)
> imcdiag(x, y, method="VIF", vif =10)
The last command imcdiag(x, y, method="VIF", vif=10) will display VIF values
and indicate either regressor is collinear or not by comparing with threshold dened
in argument as vif=10. VIF values larger than 10 will be indicated by 1.
Call:
imcdiag(x = x, y = y, method = "VIF", vif = 10)
Individual Multicollinearity Diagnostics
VIF detection
X1 38.4962 1
X2 254.4232 1
X3 46.8684 1
X4 282.5129 1
Multicollinearity may be due to X1 X2 X3 X4 regressors
1 --> COLLINEARITY is detected
0 --> COLLINEARITY in not detected by the test
===================================
65
Both individual and overall collinearity diagnostics can also be obtained by setting
type argument to "b", that is,
> mctest(x, y, type="b")
> mctest(x,y, type="b", method="VIF", vif=5, detr =0.01)
The command mctest(x,y, type="b", method="VIF", vif=5, detr=0.01)
produces both overall and individual collinearity diagnostics. The determinate
threshold is set of 0.01 (detr=0.01) for overall and VIF set to 5 (vif=5). From
command above individual collinearity diagnostic for VIF with detection indication
will be displayed and all overall collinearity diagnostics with intercept term
included for eigenvalues.
3.8.1.3 Graphical Detection
The function mc.plot can be used to get plot of VIF and eigenvalues for graphical
detection of collinearity among regressors. A horizontal red dotted line will be
produced against default threshold of VIF and eigenvalues respectively. The default
threshold for VIF and eigenvalues are set to 10 and 0.01 respectively. The values of
VIF and eigenvalues will be shown for each regressors according to their computed
values. If argument Inter is set to TRUE, the eigenvalues plot will be produced
with intercept term, otherwise without it. For example,
> mc.plot(x, y, vif=10, ev =0.01)
> mc.plot(x, y, vif=10, ev=0.01 , Inter=TRUE)
> mc.plot(x, y, Inter=TRUE)
In the next chapter, we will discuss widely used biased method to overcome the
problem of multicollinearity, that is, RR.
66
Figure 3.1: VIF and Eigenvalues plot without intercept term
Figure 3.2: VIF and Eigenvalues plot with intercept term
67
Chapter 4
Ridge Regression: Construction of R
Package
4.1 Introduction
As already discussed, for data collected either from a designed experiment or from
observational study, the multiple regression analysis is used to nd the eect of certain
explanatory variable(s) while keeping all other variables as constant. The ordinary
regression technique does not provide precise estimates of the eect of any particular
explanatory variable, especially when they are interdependent (collinear).
The OLSE or the maximum likelihood estimators (MLE) of β of Eq. (2.1) is,
β = (X ′X)−1X ′y, (4.1)
which depends on the characteristics of the matrix X ′X. If X ′X is ill-conditioned
(near dependencies among regressors or various columns ofX ′X exists or det(X ′X) ≈
0, whereas 0 < |X ′X| < 1), then LSEs are sensitive to a number of errors, such as
non-signicant or imprecise regression coecients (Kmenta, 1971) with wrong sign
68
and non-uniform eigenvalues spectrum.
Usually regressors under considerations in multiple linear regression analysis are not
all orthogonal i.e. X ′X 6= I, meaning that the correlation matrix of the regressors
departs from identity matrix, equivalently there is linearity among regressors or
correlation matrix approaches to singular having inated variances of parameters
being estimated. If this departure is large enough in absolute value with higher
standard errors then the regression coecients tends to be unstable, therefore there
is no unique solution for regression coecients and it is dicult to interpret (see
Chapter 3, Section 3.3). Similarly, collinearity remedies may be time consuming,
computationally expensive, controversial, costly or even impossible to achieve
Maddala (1992). Therefore, collinearity diagnostic techniques that signal the
existence of collinearity are crucial.
In case of multicollinearity, the researcher may be tempted to eliminate regressor(s)
causing the problem by consciously removing them from the model or by using some
screening method such as stepwise and best subset regression, etc. However, these
methods can destroy the usefulness of the model by removing relevant regressors from
the model. To control variance and instability of LS estimates, one might regularize
the coecients, as regularization methods are used to nd the model coecients β.
Two commonly used regularization methods are RR and Lasso regression and are
alternative to the OLS for collinear data (Dempster et al., 1977), especially when all
of the regressors are needed in the model, provided that ridge estimators are closer to
the parameter being estimated as compared to the LS estimator on average (Rawlings
et al., 1998). Computationally, RR suppresses the eects of collinearity and reduces
the apparent magnitude of the correlation among regressors (Rawlings et al., 1998), to
obtain more stable estimates of the coecients than the OLS estimates (Montgomery
and Peck, 1982; Myers, 1986; Tripp, 1983). In the resent Chapter, we focus on the
RR only.
Hoerl (1959, 1962, 1964), and Hoerl and Kennard (1970a,b) have developed ridge
69
analysis technique that purports the departure of the data from orthogonality i.e.
when X ′X matrix is singular or near to singular. (Hoerl, 1962) introduced RR
based on the James-Stein estimator by stating that existence of correlation
(multicollinearity) among regressors can cause errors (see Chapter 2, Section 3.3) in
estimating coecients while applying the LS method. The RR is like the LS
technique that shrinks the estimated coecients towards zero by minimizing the
mean squared error (MSE; a measure of average closeness of an estimator to the
parameter being estimated, John (see 1998)) of the estimates, making the RR
technique better than the usual LSE with respect to MSE when regressors are
collinear. As a penalty (degree of bias) is imposed on the sizes of coecients in RR,
a substantial reduction in their variances occurs, while the expected values of these
estimates are not equal to the true values and tend to under estimate the true
parameter. Though the ridge estimates are still biased but have lower MSE (more
precision) than the LS estimates (Mardikyan and Cetin, 2008), less sensitive to
sampling uctuations or model misspecication if number of explanatory variable is
more than the number of observations in data set (i.e. p > n), and omitted
variables specication bias (Theil, 1957). In summary, RR procedure is intended to
overcome the ill-conditioned situation, and is used to improves the estimation of
regression coecients when regressors are correlated and it also improve the
accuracy of prediction (Seber and Lee, 2003).
Obtaining the ridge model coecients βR is relatively straight forward, as ridge
coecients can be obtained by solving a modied form of the LS normal equations.
Suppose that the standardized regression model is
y = βR1X1 + βR2X2 + · · ·+ βRpXp + ε (4.2)
70
For RR coecients, the estimating equations will be
(1 + k)βR1 + r12βR2 + · · ·+ r1pβRp =r1y
r21βR1 + (1 + k)βR2+ · · ·+ r2pβRp =r2y... +
... + · · ·+ ... =...
rp1βR1 + rp2βR2 + · · ·+ (1 + k)βRp=rpy,
where rij is the correlation between the ith and jth regressor and riy is the correlation
between the ith regressor and the centered response variable y (y = y − y). The
βRj ; j = 1, 2, · · · , p is the vector of estimated ridge coecients and k is the non-
negative shrinkage parameter that distinguishes the RR from the OLS. The statistical
model for RR and its assumptions are the same as that of the OLS regression, such
as linearity, constant variance and independence.
All of the regressors can be standardized, scaled or centered. The standardization as
described by Belsley et al. (1980) and Draper and Smith (1998), is xj =xij−xj√∑(xij−xj)2
;
where j = 1, 2, · · · , p such that Xj = 0 and X ′jXj = 1 where Xj is the j th column of
the matrix X. Therefore, the new design matrix X matrix contains the standardized
p columns (regressors) instead of (p+ 1) with data points as follows;
X =
x11−x1√∑(xi1−x1)2
x12−x2√∑(xi2−x2)2
· · · x1p−xp√∑(xip−xp)2
x21−x1√∑(xi1−x1)2
x22−x2√∑(xi2−x2)2
· · · x2p−xp√∑(xip−xp)2
......
. . ....
xn1−x1√∑(xi1−x1)2
xn2−x2√∑(xi2−x2)2
· · · xnp−xp√∑(xip−xp)2
, (4.3)
such that, the X ′X is the correlation matrix of regressors. To avoid complexity of
dierent notations and terms, through-out this and following section, the center and
scaled design matrix X will be represented by X and centered response as y.
71
Imposing the ridge constraint:
minimizen∑i=1
(yi − β′Xi)2 s.t.
p∑j=1
β2j ≤ t,
Therefore, penalized residual sum of squares (PRSS) can be computed from
PRSS(β)l2 =n∑i=1
(yi −X ′iβ)2 + k
p∑j=1
β2j ,
= (y − Zβ)′(y − Zβ) + k‖β‖22 . (4.4)
The solution of this PRSS may have smaller average prediction error (PE) than the
OLS (βols). From Eq. (4.4) PRSS(β)l2 has unique solution and is convex. For data
points (x1, y1), (x2, y2), · · · , (xn, yn), the ridge model coecients βR are dened as
∂PRSS(β)l2∂β
= −2X ′(y −Xβ) + 2kβ ,
βRK = (X ′X + kIp)−1X ′y , (4.5)
where βRk is the vector of the standardized RR coecients of order (p − 1) × 1, as
intercept term is not included in the model and kIp is a positive semidenite matrix
added to the X ′X matrix. Note that for k = 0, βRk = β.
The bias (say k where k > 0) is introduced (called biasing or ridge parameter also
known as penalty or shrinkage parameter) after the standardization of the design
matrix X. The addition of this constant value to the diagonal elements of X ′X
matrix, guarantee the invertibility of matrix X ′X, such that there is always a unique
solution βR exists (Draper and Smith, 1998; Hoerl and Kennard, 1970a; McCallum,
1970), and the CN of X ′X + kI, i.e., CNK =√
λ1+kIλp+kI
will reduce compared to X ′X,
where λ1 = 1, is the largest and λp, is the smallest eigenvalues of the correlation
matrix X ′X. Note that, the CN of X ′X+kI decreases as k increases. Therefore, the
ridge estimator is an improvement over the LSE for collinear data and it is dierent
72
from LS due to addition of the perturbation matrix kIp to X ′X. The addition of this
perturbation matrix could take dierent forms.
The biasing parameter k ≥ 0 is non-stochastic ridge constant, increases from zero
and continues upto innity, and it is used to control the size of regression coecients
and amount of regularization. As k → 0, βR → β and as k → ∞, βR → 0, while the
tted values becomes more smoother. In case of orthonormal design matrix X, the
βRj =βolsj1+k
.
The βR is computed on standardized variables, so they need back to the original
scale, that is;
βR =
(βRjSxj
). (4.6)
The intercept term for RR βR0 can be estimated for standardized RR by using the
following relation
βR0K = y − (β1R · · · βpR)x′ .
= y −p∑j=1
xjβjR . (4.7)
The RR method also knows as the Tikhonov regularization (Tikhonov, 1943) and it
is intended to overcome ill-conditioned situation where near dependencies between
various columns of matrix X, causing the X ′X to be singular or near to be
singular, giving rise to unstable regression coecients (parameters) estimates,
typically with large standard errors. Therefore, it is desirable to select the smallest
value of k (amount of bias) for which stability of regression coecients occurs, and
there always exists a particular value of k for which the total MSE of the ridge
estimates is less than the MSE of the OLS estimators, however, the optimum value
of k (which produces minimum MSE as compared to other values of ks) varies from
one application to another and is unknown. Any estimator that has a small amount
73
of bias (k), less variance and considerably more precise than an unbiased estimator,
may be preferred due to larger probability of being close to the true parameter
being estimated. Therefore, the criterion of goodness of estimation considered in
the RR is the minimum total MSE, while in the OLS regression is the minimum
RSS.
The ridge solution are not equivariant under scaling of the regressors (Berk, 2008;
Brown, 1977; Smith and Campbell, 1980) because the ridge coecients can change
to great extent when a predictor is multiplied by a constant. XjβjjRk depends not
only on the value of k but also on the scaling of the j th predictor. XjβjjRk may also
depend on the scaling of other predictors too (James et al., 2013). Therefore, before
solving βR, regressors need to be standardized. Most penalties are on the magnitude
(size) of the coecients, as it is important to normalize the predictors, so they are
unit-less i.e. in comparable numerical units. Therefore, ridge solutions are dicult
to interpret as some of the ˆβRj's are set exactly to zero. This does not mean that all
biased model are not better, we need a way to nd "good" biased model.
4.2 Hoerl and Kennard's Reasoning
Pindyck and Rubinfeld (1979) proved that the MSE can be decomposed into two
quantities (i) total variance and (ii) total squared bias of each regression coecient,
i.e.,
MSE =
p∑j=1
V ar(βRj) +Bias2(βR) (4.8)
The ridge estimator (βRj) inates the RSS and which can be written as
74
φ = RSS = ε′ε = (y −XβR)′(y −XβR) ,
= (y −Xβ)′(y −Xβ) + (βR − β)′X ′X(βR − β) ,
= φmin + φ0(βR) , (4.9)
where φ0(βR) = (βR− β)′X ′X(βR− β) is the ination in the RSS. The RR minimizes
the squared length of the coecients subject to constant ination in RSS φ0(βR),
which is determined by the minimum MSE.
The βR is a solution to minimize β′RβR subject to
φ0 = (βR − β)′X ′X(βR − β)
The solution of this problem is obtained by minimizing the Lagrangian function
f(βR) = β′RβR +1
k[(βR − β)′X ′X(βR − β)− φ0]
where 1kis the Lagrangian multiplier. At minimum f(βR), we have
∂
∂βR= 2βR +
1
k2X ′X(βR − β) = 0
= (X ′X + kI)βR = X ′Xβ = X ′y
⇒ βR = (X ′X + kI)−1X ′y
The ridge estimator βR is a vector of shortest coecients among those that have
constant inated RSS (φ0) and φmin is the RSS for OLS model.
75
4.3 Reparameterization of the model
Let the linear model as given in Eq. (2.1) and assumes that the data are in
standardized form as described by Belsley (1991), and Draper and Smith (1998).
4.3.1 Singular Value Decomposition (SVD)
Usually, matrix inversion is avoided when computing βR as inverting X ′X can be
computationally expensive and due to collinear data inversion may be impossible.
Rather we use some popular parameterization in the RR based on the SVD of X.
Though SVD is the most computationally expensive but numerically most stable
(Seber and Lee, 2003) and it gives insight into the nature of RR.
The n× p design matrix X can be reparameterized using SVD as:
X = UDV ′,
where Un×p = (u1, u2, · · · , up) is an orthogonal matrix basis for the space spanned by
the columns of X, Dp×p = diag(d1, d2, · · · , dp) is a diagonal matrix of the singular
values d1 ≥ d2 ≥ · · · ≥ dp ≥ 0. Note that, if one or more values of dj = 0 then
X is singular, and Vp×p = (v1, v2, · · · , vp) is a orthogonal matrix basis for the space
spanned by the rows of X.
Therefore, we can write,
βR = (X ′X +KIp)−1X ′y ,
= (X ′X +KIp)−1X ′Xβ ,
= [Ip +K(X ′X)−1]−1β ,
= V diag
(dj
d2j +K
)U ′y .
76
The eigen (or spectral) decomposition of X ′X,
X ′X = (UDV ′)′(UDV ′) ,
= UD′U ′UDV ′ ,
= UD′DV ′ ,
= V D2V ′ .
The coordinates of y, w.r.t. the orthogonal basis U are computed in the RR and
these coordinates are shrunk by factord2j
d2j+k. Therefore, a greater amount of shrinkage
occurs when k is large and djs are smaller.
y = Xβ ,
= X(X ′X)−1X ′y = UU ′y ,
yR = X βR, X(X ′X + kIp)−1X ′y , (4.10)
= HRk y ,
= UD(D2 + kIp)−1DU ′y ,
where HRk is hat matrix (smoother matrix). Notice that for k ≥ 0, thed2j
d2j+k≤ 1.
4.3.2 Eigenvalues and the RR
Let Λp×p be a diagonal matrix of eigenvalues of X ′X. Following Hilt et al. (1977)
and Hemmerle and Carey (1983), there always exists an orthogonal transformation
matrix P of normalized eigenvectors associated with Λ, such that Λ = P ′X ′XP then
Eq. (4.10) can be written in canonical linear model or uncorrelated component model
(orthogonal form) as,
y = XPP ′β + ε ,
= X∗α + ε ,
77
where X∗ = XP and α = P ′β, P ′P = I = PP ′, Λ = X ′∗X∗.
The OLSE for α in canonical form is,
α = (P ′X ′XP )−1X ′∗y ,
= Λ−1X ′∗y = P ′β , (4.11)
⇒ β = Pα .
The αj are called uncorrelated components, Pagel and Lunneborg (1985) and
V ar(α) = σ2Λ−1 = σ2(X ′∗X∗)−1 , (4.12)
= P ′ V ar(β)P .
For improved estimate of α (say αR), Hoerl and Kennard (1970b) showed that taking
kj = σ2
α2j, the MSE of αRk is minimized,
αRk = (Λ + kIp)−1X ′∗y , (4.13)
= (I + kΛ−1)−1αRk , (4.14)
βRk = PαR , (4.15)
V ar(αRk) = σ2(Λ + k)−1Λ(Λ + k)−1 . (4.16)
Since V ar(αRk) = σ2 λj(λj+kj)2
, the ridge estimator has a smaller variance compared to
OLSE, for kj > 0, However E(αRjk) = αjλj
(λj+kj), αRjk is biased for kj not equal to
zero.
From Eq. (4.13) to (4.16), the relationship between k and βR is that the bias of βR
increases as k increases while the variance decreases. Therefore, for known value of k,
βRk has a smaller MSE than βls. The true value of k depends on unknown parameter
α and σ2.
78
4.4 The RR Methods
There are three dierent RR methods that can be used to reduce the existence of
collinearity problem in data set. These methods are (i) Ordinary Ridge Regression
(ORR), (ii) Generalized Ridge Regression (GRR) and (iii) Directed Ridge Regression
(DRR). All these methods of RR are better than that of the OLS method (El-Dereny
and Rashwan, 2011). A short description about each method is as given below.
4.4.1 Ordinary Ridge Regression (ORR)
If the values of all k's are the same, (k1 = k2 = · · · = kp), in other words k is xed
(k is scalar), the resulting estimators are called the ORREs (John, 1998),
βRk = (X ′X + kIp)−1X ′y; k ≥ 0 ,
Note that:
• There are several biasing parameter values (ks) for which the ridge estimator
has smaller MSE than the OLSE.
• The ridge MSE has two components (i) Total Variance (ii) The Total Bias
MSER = σ2
p∑j=1
λj(λj + k)2
+
p∑j=1
k2α2j
(λj + k)2, (4.17)
where λj are the eigenvalues of X ′X. The RHS of Eq. (4.17) shows the eect
of k on the total variance of the ridge estimates on the regression coecients
for k = 0, Total Variance = MSE(αols) = σ2∑p
j=1 = 1λj, i.e. MSE of OLS
estimator.
• The total variance is decreasing function of k.
• The total bias is increasing function of k and is bounded above by β′β
79
4.4.2 Generalized Ridge Regression (GRR)
The GRR estimators can be given as follows
αRk = (X ′∗X∗ + k)−1X ′∗y ,
= (Λ + k)−1X ′∗y ,
= (I + kΛ−1)−1α ,
where k = diag(k1, k2, · · · , kp), kj > 0, the OLSE for α and V arα is as in Eq. (4.11)
and (4.12), respectively.
α = (X ′∗X∗)−1X ′∗Y ,
= Λ−1X ′∗y ,
and V (α) = σ2(X ′∗X∗)−1
= σ2Λ−1 .
It follows from (Hoerl and Kennard, 1970b) that the value of kj which minimizes the
MSE(βRk), where
MSE(βRk) = σ2
p∑j=1
λj(λj + kj)2
+
p∑j=1
k2jα2j
(λj + kj)2,
where kj = σ2
α2j, σ2 represents the error variance of the model shown in equation (4.1),
and αj is the jth element of α.
Hocking et al. (1976) showed that for known optimal kj, theoretically the GRR
estimator is superior to all other class of biased estimators they considered, however,
empirically it may be opposite. Note that, the optimal value of kj depends on the
unknown parameters σ2 and αj and must be estimated from the data.
Hoerl and Kennard (1970b), suggested to replace σ2 and α2j by their corresponding
80
unbiased estimator, i.e. kj = σ2
α2j, where σ2 =
∑ε2i
n−p is the residual mean square
estimate (RMS), which is unbiased estimator of σ2 and αj is the jth element of α,
(an unbiased estimator of α).
4.4.3 Direct Ridge Regression (DRR)
Guilkey and Murphy (1975) proposed the method of estimation (which is an
improvement of Hoerl's iterative procedure) based on the relationship between
eigenvalues of X ′X and the variance of αj. Since V (α) = σ2Λ−1, relatively precise
estimation is achieved for corresponding to large eigenvalues, on the other hand,
relatively imprecise estimation is achieved for αj corresponding to small eigenvalues.
The DRR estimator (DRRE) results in an estimate of αj which is less biased than
the that of GRR estimator, when adjusting only those elements of Λ corresponding
to the small eigenvalues of X ′X.
The steps of computing the DRREs are summarized as follow:
1. Find MSE from OLS model, σ2 and α = Λ−1X ′∗y.
2. Find the eigenvalues λj eigenvectors P of X ′X and X∗ = PX.
3. Find kj(0) = σ2
α2j, note that k is only added to diagonal element of
λj ≥ 10−cλmax, where c is arbitrary constant.
4. Compute the DRRE
α(0)kDRR
= (Λ + kI)−1X ′∗y .
5. Re-estimate
kj(1) =σ2
αj(0)kDRR
6. Optimal values of kj are obtained by repeating Steps 4 and 5 such that the
dierence of squared length of αj2DDR and α2j+1DDR is very small. Form number
81
of iterations,
αmkDRR = (Λ + kI)−1X ′∗y .
where k is the diagonal matrix with elements
k1(m− 1), k2(m− 1), · · · , kp(m− 1). Note that the original regression of the
ORR is
αkORR = Pα
(m)kDRR
.
4.5 Alternative Way of Understanding RR
For linear regression, we don't expect that an estimator have too large coecients
(βs). Therefore, the value of β can be penalized. Given a response variable yεRn
and design matrix XεRn×p, in the LS estimation
n∑i=1
(yi −Xiβ)2 or minβ
n∑i=1
(yi −Xiβ)2,
is minimized. To penalize the value of βs, we can estimate βRs by minimizing
βRk = argminβ εRp
n∑i=1
(yi − x′iβ)2 + k
p∑j=1
β2j ,
= ‖y −Xβ‖22β εRp
+ k‖β‖22 .
The solution of β is βRk = (X ′X + kIp)−1X ′y.
The shrinkage parameter k controls the strength of the penalty introduced. Larger
the k, stronger the penalty on β and the solution of βRk will be smaller in size as
compared to LS coecients.
The βR is a biased estimator but reduces the variance of the estimate, and is a linear
transformation of the OLS, while the sum of squared residuals (SSR) is an increasing
function of k. There always exists a k > 0, such that βR has smaller MSE than βols.
82
i.e. MSE(βR) < MSE(βOLS), (Gruber, 1998; Judge, 1998).
4.6 Properties of Ridge Estimators
Let Xj denote the jth column of X (1, 2, · · · , p), where Xj = (x1j, x2j, · · · , xnj)′.
As already discussed, assume that the regressors are centered and normalized, such
thatn∑i=1
xij = 0 andn∑i=1
x2ij = 1. Centering and normalizing the regressors allows one
to interpret all eects in a comparable manner. Assume that the response is also
centered. In this case, the intercept is zero, and can thereby be removed from the
model.
The RR is the most popular among biased methods, because its relationship to the
OLS method and its statistical properties are well dened. Most of the RR
properties have been discussed, proved and extend by many researchers such as
(Allen, 1974; Hemmerle, 1975; Hoerl and Kennard, 1970a,b; Marquardt, 1970;
McDonald and Galarneau, 1975; Newhouse and Oman, 1971).
4.6.1 Mean of mathbfβR
The expected value of the βR is:
E(βR) = E(X ′X + kIp)−1X ′(Xβ + ε) ,
= (X ′X + kIp)−1X ′Xβ ,
= [I + k(X ′X)−1]−1β .
4.6.2 Shorter Regression Coecients
The RR makes the assumption that the RR coecients after normalization are not
likely to be very large, i.e. ridge regression gives a shorter regression coecients (for
83
k > 0) than the OLS regression, that is, (β′RβR ≤ β′β).
β′β = y′X(X ′X)−2X ′y = y′X∗Λ−2X ′∗y ,
=
p∑j=1
[(X ′∗y)2jλ2j
], (4.18)
β′RβR = y′X(X ′X + kI)−2X ′y = y′X∗(Λ + kI)−2X ′∗y ,
=
p∑j=1
((X ′∗y)2j
(λj + k)2
); k ≥ 0 . (4.19)
From Eq. (4.18) and Eq. (4.19), it is obvious that βRβR ≤ ββ.
4.6.3 Linear Transformation
The βR is a linear transformation of the OLSE, that is,
βR = Zβ,
where,
Z = (X ′X + kI)−1X ′X ,
= Pdiag
(λj
(λj + k)
)P ′ ,
= P
λ1
λ1+k
λ2λ2+k
. . .λp
λp+k
P ′ .
84
4.6.4 Variance Covariance Matrix of βR
Since estimator βR is a linear transformation of βols (see, 4.6.3), and if V ar(ε) = σ2In
then the variance-covariance matrix of βR is
Cov(βR) = Cov(Zβ) ,
= Z Cov (β)Z ′ ,
= σ2(X ′X + kIp)−1X ′X(X ′X + kIp)
−1 , (4.20)
= σ2[V IF ] ,
where σ2 = 1v(y −XβR)′(y −XβR), v = n− p (see Halawa and El-Bassiouni, 2000),
however, Cule and De Iorio (2012) suggested to the residual eective degrees of
freedom, given by Hastie and Tibshirani (1990), that is v = n − tr(2H − HH ′)
reduces to n− p when k = 0 and V IF = (X ′X + kI)−1X ′X(X ′X + kI)−1 in matrix
form and VIF in terms of eigenvalues of the correlation matrix X ′X,
V IF = diag(Pdiag
(λj
(λj + k)2
)P ′) ,
= P (Λ + kIp)−1Λ(Λ + kIp)
−1P ′ .
A suitably chosen value for k may substantially reduce the estimated variance and
hence increases overall estimation accuracy.
4.6.5 Smaller Variance
RR produces smaller variances of coecients relative to the OLS regression,
however it is not necessary that it reduces the covariance or correlation between
ridge coecients as Cov(βRk) = σ2V IF .
85
For k > 0, the variance of RR coecients reduces, that is;
V (βR) = σ2
p∑j=1
λj(λj + k)2
. (4.21)
The reduction in variance can also be observed from VIF values, that is;
V IFjj =
p∑j=1
λj(λj + k)2
P 2ji; ∀ j = 1, 2, · · · , p . (4.22)
However, the Cov can be inated or deated due to the negative or positive sign of
P 2ji = PjiPli and can be seen from the expression;
V IFjl =
p∑j=1
λj(λ+ k)2
PjiPli; for j 6= l, and∀ j, l = (1, 2, · · · , p) .
4.6.6 Bias of βR
The least square method has no bias (they are BLUE) but have larger variance than
the RRE. The sampling variance of βR decreases monotonically as value of k increases
(see Eq. (4.21) and (4.22)), while bias of the ridge estimator increases with increase
of k (Judge et al., 1985; Vinod and Ullah, 1981). The ridge estimator is negatively
biased with the amount of bias given below:
Bias(βR) = E(βR)− β ,
= −k(X ′X + kI)−1β ,
Bias(βR) = −kPdiag(
1
λj + k
)P ′β .
It can be seen that the bias produced by the RR is negative and it is a function of
an unknown population regression coecient β.
86
4.6.7 The MSE of βR
The total MSE of ridge estimator is as follows:
MSE = E||βR − β||2 ,
=∑
V ar(βjR)
+Bias2(βjR) ,
MSE = σ2
p∑j=1
λj(λj + k)2
+
p∑j=1
k2α2j
(λj + k)2.
All eorts are made to nd a value of k such that MSE(βR) < MSE(βols), whereas,
theMSE(βR) depends on unknown parameters, k, β and σ2 (Vinod and Ullah, 1981),
which cannot be calculated in practice, but k has to be estimated from the real data.
The rst part of MSE is the total variance of ridge estimates, which is a monotone
decreasing function of k. Since k and λj's are positive quantities, following relation
can be used to nd total variance of the estimates using V ar-Cov matrix or
eigenvalues;
∑V ar(βR) = tr[Cov(βR)]
= σ2
p∑j=1
(λj
(λj + k)2
)
The total variance (describes the random portion of error) ranges from
(σ2
2∑j=1
1λj
)to 0; when k varies from 0 to ∞, meaning that total variance reduces to zero when
k →∞.
The second term of MSE is the total square bias (describes the systematic portion
of error) of ridge estimates is a monotone increasing function of k with a range of
(0,p∑j=1
β2j ) for values of k = 0 to ∞. The total square bias also depends upon the
unknown population regression coecient vector β and it is the squared distance of
87
Xβ to β, i.e., the squared of a bias introduced when βR is used instead of β.
Bias2(βR) = k2β′(X ′X + kIp)−2β ,
= k2β′diag
[1
(λi + k)2
]β ,
= k2p∑j=1
α2j
(λj + k)2.
4.6.8 Minimum Distance between βR and β
The distance between βR and the true vector β is minimum, making the sense that
ridge estimator βR are better than the OLSE. This distance is small because the RR
have smaller MSE than the LS regression.
4.6.9 Inated RSS
From Eq. (4.4), larger the value of penalty (biasing parameter k), the larger the
kp∑j=1
β2j , and hence larger the increment to RSS because k is the weight to the
penaltyp∑j=1
β2j (Berk, 2008). All the estimators based on criterion of minimum MSE
will inate the RSS (cf, Lee, 1979). From Eq. (4.9),
φ0 = k2β′R(X ′X)−1βR ,
φ0 = k2α′RΛ−1αR ,
= k2p∑j=1
α2Rj
λj,
where αRj = P ′βR is the vector of ridge coecients in a factor space dened by
orthogonal transformation of X, as described in Section 4.3. Note that, at a certain
value of k, the RSS will not have been inated to an unreasonable values (Hoerl and
Kennard, 1970b).
88
4.6.10 Smaller R2
The multiple R2 for RR (R2R) is smaller than that of the OLS (R2
R ≤ R2ols) and can
be expressed as
R2R =
Regression SS
Total SS=y′y
y′y,
=β′RX
′XβRy′y
,
=β′RX
′y − kβ′RβRy′y
.
In term of eigenvalues of X ′X, the denominator can be
= α′R Λ αR ,
=
p∑j=1
λjα2Rj,
=
p∑j=1
λj(λj + k)2
(X ′∗y)2j .
The R2R is an monotone decreasing function of k, such that if R2
R = 0 then V IF = 1
and if R2R = 1 then V IF = ∞. Therefore, ridge estimate may not impart the best
t to the data.
4.6.11 Sensitivity to sampling Fluctuations
The βR estimates are less sensitive to the sampling uctuation/ variation than the
OLS estimates because the RR gives smaller sampling variance than the OLS, see
Section (4.6.5).
89
4.6.12 More Accurate Prediction
If bias is not too large, then the RR procedure produce more accurate prediction
than the OLS. The unbiased forecasting error variance is
σ2f = σ2[1 + x′V x] = σ2
[1 +
p∑j=1
x∗2jλj
], (4.23)
where σ2V is the V ar-Cov matrix of the estimator estimated.
For ridge estimate, in forecasting error variance a squared bias is added i.e.
σ2fR
= σ2f +Bias2(βR) , (4.24)
= σ2
[1 + x′PDiag
(λj
(λj + k)2
)P ′x
]+Bias2(βR) , (4.25)
= σ2
[1 +
p∑j=1
λjx∗2j(λj + k)2
]+Bias2(βR) . (4.26)
From Eq. (4.23) and (4.24 to 4.26), the forecasting error variance for OLS estimate
is much larger than the ridge estimate, if the bias produced by the RR is not large
enough relative to the reduction in variance. The forecasting error variance consists
of (i) random error and (ii) systematic error. Note that when bias is relatively large,
more accurate prediction can be obtained by dividing the sample (sample should
be large) into two sets, one of the sample is used to estimate the biased parameter
and the other sample is used to estimate the bias in the prediction of the dependent
variable.
4.6.13 Wide range of Biasing Parameter
There exists a wide range of biasing parameter k; 0 < k < kmax, having smaller set
of MSE than the OLS estimates.
The eectiveness index (EF) of RR is the ratio of reduction in total variance to the
90
total squared bias by the RR, i.e.
EF =Reduction in total variance
Bias2(βR),
=σ2tr(X ′X)−1 − σ2tr(V IF )
Bias2(βR),
=σ2[∑p
j=11λj−∑ λj
(λj+k)2
]k2∑p
j=1
α2j
(λj+k)2
.
The EF is a decreasing function of k because the total variance of RR is a monotone
decreasing function and Bias2(βR) is an increasing function of k. As k varies from
zero to innity, the EF of RR ranges from innity to zero. Further,
MSE(β)−MSE(βR) =∑
V ar(β)−∑
V ar(βR)−Bias2(βR) ,
= EF ×Bias2(βR)−Bias2(βR) ,
= (EF − 1)Bias2(βR) .
For any k, if EF > 1, then MSE(β) −MSE(βR) > 0, i.e., the RR have smaller
MSE. Setting k = kmax such that EFk = 1, then all k's that are less than kmax would
have smaller MSE.
EFk can be used to indicate the performance of a RR. If EFk>1, the ridge regression
is valid, but if is a sucient but not a necessary condition as EFk is a conservative
estimate of the true eectiveness index (Billor, 1992).
4.6.14 Optimal Value of k
There always exists a positive optimal ridge parameter k (say kopt) which gives
minimum MSE (Kasarda and Shih, 1977; Theobald, 1974), Hoerl and Kennard
(1970b), called this as "existence theorem". The demand for uniqueness and
objectivity has energize the search for an optimal choice of biasing parameter k.
91
4.6.15 Optimal k as non-stochastic parameter
The optimal kopt depends on the true regression coecient (β) and the variance of
the residuals of the linear model σ2 and should be estimated from the observed data.
The methods that uses the y data for choosing the optimal value of biasing parameter
k are called stochastic methods. The choice of k as a function of only regressors is
non-stochastic, that is, k is not a random variable, for example V IF does not depend
on y observation, so its non-stochastic in nature.
The multicollinearity problem is due to correlation among regressors themselves,
it is not the correlation between regressors and regresand, therefore to reduce the
harmful eect of multicollinearity using RR, the optimal kopt should not depend on
any parameter that depends on the y variable such as β and σ2. In other words the
optimal k should be a non-stochastic parameter.
4.6.16 Eective Degrees of Freedom (EDF)
The ridge parameter is not really interpretable between the models. It is not on a
natural scale and is not the most useful for understanding the amount of shrinkage
taking place. Instead, the EDF allows to interpret the impact of penalty. To measure
exibility in t, the hat matrix (HRk = X(X ′X + kI)−1X ′) is used to compute the
EDF, dened as
dfRk = trace[HRk ] ,
=
p∑j=1
λj(λj + k)
,
=
p∑j=1
d2jd2j + k
,
where λjλj+k
is called shrinkage factor. EDF is a decreasing function of k and have the
property that dfR0 = p and as k →∞, dfRk → 0.
92
Setting dfRk = trace[HRk ] is a common approximation of EDF for the RR model.
The error df can be given by n− tr(2H −HH ′) and used in denominator of the σ2
estimate. Therefore, the tr(2H −HH ′) is the eective number of parameter in the
df of error.
4.7 Methods of selecting values of k
As already discussed, the optimal value of k is one which gives minimum MSE. There
is one optimal k for any problem, while a wide range of k, (0 < k < kopt) give smaller
MSE as compared to OLS. The addition of a small number to the diagonal elements
of matrix X ′X decreases the CN and improves the conditioning of matrix (Vinod
and Ullah, 1981).
For collinear data a small change in k values, varies the RR coecients rapidly. At
some value of k, ridge coecients stabilize and the rate of change slows down
gradually to almost zero. Therefore, a disciplined way of selecting shrinkage
parameter (k) is require that minimizes the MSE. The biasing parameter k depends
on the true regression coecients β and the variance of the residuals σ2,
unfortunately which are unknown, but can be estimated from the sample data.
Theoretically and practically the RR is used to propose some new methods for the
choice of k to investigate the properties of ridge estimates, since biasing parameter
plays a key role, while the optimal/proper choice of k is the main issue in this contexts.
In the literature, there are many methods for estimating the biasing parameter k, for
example, (see Allen, 1974; Aslam, 2014a; Guilkey and Murphy, 1975; Hemmerle, 1975;
Hoerl et al., 1975; McDonald and Galarneau, 1975; Obenchain, 1975; Hocking et al.,
1976; Lawless and Wang, 1976; Vinod, 1976; Kasarda and Shih, 1977; Hemmerle and
Brantle, 1978; Wichern and Churchill, 1978; Nordberg, 1982; Saleh and Kibria, 1993;
Singh and Tracy, 1999; Wencheko, 2000; Kibria, 2003; Khalaf and Shukur, 2005;
Alkhamisi et al., 2006; Alkhamisi, 2007; Khalaf, 2011, 2013) among many more, but
93
there is no consensus about which method is preferable (Chatterjee and Hadi, 2006).
Each of the estimation method of biasing parameter cannot guarantee to give a better
k or even cannot give a smaller MSE compared to the OLS, and also have their own
advantages and disadvantages. These estimation methods can be classied as (i)
Subjective and (ii) Objective methods. In this section various methods of estimating
k by dierent researchers will be discussed.
4.7.1 Subjective Methods
A good estimator should have small prediction error (PE) on average, because the
ridge estimates for β leads to a substantial decrease in the PE and variance due to
small bias introduced in the model. Despite of this, there are considerable controversy
exists in the selection of biasing parameter k. Therefore, there are ways to see if there
is reasonable choice of k (Gruber, 1998). The ridge trace, df-trace, and VIF-trace
etc., provides graphical evidence of the eect of multicollinearity on the regression
coecient estimate and also account variation by the ridge estimator as compared
to the OLSE. Selection of k from these methods is judgmental or subject.
• Ridge Trace
The ridge trace plot is used to nd the best value of the ridge biasing parameter
k. Ridge trace is a plot of the regression coecients βR as a function of k in
interval of [0, 1], and used to depicts the eect of collinearity on each of the
coecients. The eect of collinearity is repress when in ridge trace, value of k
increases and all the coecients of estimates are stabilized (see Chatterjee and
Hadi, 2006; Ahmad and Gilani, 2010).
The optimal value of k is selected visually (subjective approach) from horizontal
axis which start to give stabilized coecients. At certain selected value of k,
the RSS remains closed to its minimum value and the V IFj gets closer to or
less than 10, where V IFj = 1 is a characteristic of an orthogonal system and
94
V IFj < 10 would indicate a non-collinear or stable system (Chatterjee and
Hadi, 2006), while the incorrect signs of coecients at k = 0 would change to
the proper signs at optimal value of k. The ridge trace is not purported to
give a unique solution, rather it render a vaguely dened class of acceptable
solutions.
For higher value k, the trace appear to be more stable, meaning that larger
values of k cannot guarantee to obtain optimal value of k which gives estimates
that are better than the OLS. In spite of limitations, the ridge trace is still
a useful graphical representation to check the optimal k, new methods are
available in the literature.
• df-Trace criterion
A ridge trace like method called df-trace criterion, non-stochastic in nature
(Tripp, 1983) is based on the EDF. For computation, see Section, 4.6.16. The
procedure of df-trace is plotting of the EDF against k and choosing k for which
trace becomes more stable (Tripp, 1983).
The EDF is monotone decreasing function of ridge parameter k of the RR t.
Note that dfRk = 0 when k = 0 i.e., no regularization is done and dfRk →
0 as k →∞.
• VIF Trace
VIF trace is also like ridge trace and df trace. The procedure of VIF trace is
to plot VIF values against k and choosing k for which VIF is less than 10 or
near to 1.
4.7.2 Objective Method (Based on Mathematical Formula)
Suppose we have set of observations (x1, y1), (x2, y2), · · · , (xn, yn) and RR model as
given in Eq. (4.5). Following are also subjective or judgmental methods for selection
of biasing parameter k, but they required some calculations to obtain these biasing
95
parameters.
• Ck Statistic
Ck is similar to Mallows Cp statistic (Kennard, 1971; Mallows, 1973), it is
based on the use of prediction criteria for the better choice of k and consists of
selecting the values of ridge parameter k for which Ck is minimized.
Ck =SSRk
s2− n+ 2 + 2trace(HRk) ,
=SSRk
s2+ 2(1 + trace(HRk))− n ,
where SSRk is the SSR and HR is the hat matrix from RR. The use of Ck
statistic may be plotting of Ck against dierent ks, and selecting a k at which
Ck is minimized.
• PRESSk
Predicted residual sum of squares (PRESS) measure, introduced by (Allen,
1971, 1974) and is computed by dropping only one observation at a time from
the model tting and predicting the left-out observation for each choice of
biasing parameter k. The procedure of dropping one observation is repeated n
times.
PRESSk =n∑i=1
(yi − y(i,−i)k)2 =
n∑i=1
e2(i,−i)k ,
where i = 1, 2, · · · , n, y(i,−i)k is the predicted value at the ith x data point from
the tted model to the data that leaves out yi. y(i,−i)k cannot be too close to
y values as yi is not used in calculating the y(i,−i)k . The PRESS residuals are
true prediction errors with y(i,−1)k independent of yi. All of the n leave-one-out
PRESS residuals can be calculated by tting full regression model by using,
PRESS =n∑i=1
(eik
1− 1n−HiiRk
),
96
where eik is the ith residual at specic value of k and hiiRk is the diagonal
element from hat matrix. Plotting PRESS against k may help in selecting
optimal value of biasing parameter.
The PRESS measure can be used to compute cross validated R-square, that is,
R2cv = 1− PRESSK
TSSn−1
.
• Selection of k via Cross Validation (CV)
The cross validation (CV) method can be used to access the predictive quality
of the penalized prediction models, such as ridge regression. It is also used to
compare the predictive ability of dierent values of the shrinkage parameters
k. Therefore, a large range for possible values of k : [0, c] is selected. For each
xed values of k in [0, c], consider the CV as follows, for each regressor j. The
PE for (Xi, yj) is
errjk = (yi −XjβjRk)2 .
The CV values is then
CVk = n−1n∑i=1
errjk .
The best k is the minimum point of CVk.
• Selection of k via Generalized Cross Validation (GCV)
Golub et al. (1979) recommended a method based on minimization of the
PRESS Criterion and called it GCV that has some advantages over Allen's
PRESS. With k, the RR will give a t to the observation as
y = XβR = X(X ′X + kI)−1X ′y = HRky .
97
The GCV is then,
GCVk =
n∑i=1
e2ik
[n− (1 + trace(HRk))]2,
=SSRK
[n− (1 + trace(HRK ))]2,
=1
n
n∑i=1
e2ik
[ 1n−
n∑i=1
(1− hiik)]2,
where 1 + trace(HRk) is eective sample size. The best k is the minimum point
of GCVk. For further details, Golub et al. (see 1979).
• Index of Stability of Relative Magnitude (ISRM)
For quantication of concept of stable region, Vinod (1976) purported a
numerical measure called ISRM,
ISRMk =
p∑j=1
p(
λjλj+k
)2∑p
j=1λj
(λj+k)2λj− 1
2
.
For orthogonal system, ISRM will be zero, while for k = 0, ISRM will be largest.
Plotting the ISRMk as a function of k can have multiple local minima. The
global minimum of ISRMk tends to emphasize stability without regard to the
bias. Considering the importance of bias (Vinod and Ullah, 1981), suggested a
situation where the bulk of the potential reduction in ISRM is achieved, say
at the rst local minimum or at the pre-specied percentage (say 50%) of the
potential reduction.
• Multicollinearity Allowance (m-scale)
Hoerl and Kennard (1970b) suggested the range for plotting ridge trace (0 ≤
k ≤ 1) which can be misleading, because in general it can range innite 0 <
k < ∞. Therefore, Vinod (1976) proposed another non-stochastic choice of
98
k, based on a new horizontal scaling for the ridge trace having nite range
(0 ≤ m ≤ p), and called this scale as multicollinearity allowance m or modied
Hoerl's and Kennard Ridge Trace. Vinod (1976) suggested that instead of
plotting ridge coecients as a function of biasing parameter k, it should be
plotted as a function of m, dened below;
m = p−p∑j=1
λjλj + k
,
= p−p∑j=1
d2jd2j + k
.
When k = 0, m = 0 and k = ∞, m = p, where p is number of regressors.
There is a unique value of m for each k and vice versa. Essentially m indicates
deciency in the rank of X ′X.
• Using Information Criteria
Information criteria are common way of choosing among model while balancing
the competing goals of t and parsimony. In order to apply AIC or BIC to the
problem of choosing k, an estimate of the df is required, see Section 4.6.16.
The model selection criteria AIC and BIC are computed by quantifying the df
in ridge regression model and can be used for the choice of optimal k:
AIC = nlog(RSS) + 2df ,
BIC = nlog(RSS) + dflog(n) ,
where RSS is the ridge RSS.
There are other methods to estimate biasing parameter k. Following is the list
of various k from dierent author.
99
Table 4.1: Existing Ridge Parameter k already available literature
Sr. # Formula Reference
1) KHKB = pσ2
β′βHoerl and Kennard (1970b)
2) KTH = (p−2)σ2β′β
Thisted (1976)
3) KLW = pσ2∑pj=1 λj α
2j
Lawless and Wang (1976)
4) KDS = σ2
β′βDwividi and Shrivastava (1978)
5) KLW = (p−2)σ2×nβ′X′Xβ
Venables and Ripley (2002b)
6) KAM = 1p
p∑j=1
σ2
α iKibria (2003)
7) KGM = σ2(p∏j=1
α2j
) 1p
Kibria (2003)
8) KMED = Median σ2
α2j Kibria (2003)
9) KKM2 = max
1√ˆσ2
α2j
Muniz and Kibria (2009)
10) KKM3 = Max
(√σ2j
α2j
)Muniz and Kibria (2009)
11) KKM4 =
p∏j=1
1√σ2j
α2j
1p
Muniz and Kibria (2009)
12) KKM5 =
(p∏j=1
√σ2j
α2j
) 1p
Muniz and Kibria (2009)
13) KKM6 = Median
1√σ2j
α2j
Muniz and Kibria (2009)
14) KKM8 = max
1√λmaxσ2
(n−p)σ2+λmaxα2j
Muniz et al. (2012)
15) KKM9 = max(√
λmaxσ2
(n−p)σ2+λmaxα2j
)Muniz et al. (2012)
16) KKM10 =
p∏j=1
1√λmaxσ2
(n−p) ˆσ2+λmaxα
2j
1p
Muniz et al. (2012)
100
Sr. # Formula Reference
17) KKM11 =
(p∏j=1
√λmaxσ2
(n−p)σ2+λmaxα2j
) 1p
Muniz et al. (2012)
18) KKM12 = Median
1√λmaxσ2
(n−p)σ2+λmaxα2j
Muniz et al. (2012)
19) KKD = max(
0, pσ2
α′α− 1
n(V IFj)max
)Dorugade and Kashid (2010)
20) K4(AD) = HarmonicMean[Ki(AD)]
=2p
λmax
p∑j=1
σ2
α2j
Dorugade (2014)
4.8 Performance Criteria
The quality of a regression method is often determined by performance of estimator(s)
by their predictive capability. Therefore, the goal is usually to select a regression
model that has best prediction. For this purpose, dierent model selection criteria
are used to measure the goodness of prediction. The SSR, R2 do not reect the
future prediction capability of the regression model, but are measure of goodness of
t. To check the performance of estimator(s), the cross-validation methods can be
used to obtain superior parameter estimate and model predictive power (Månsson
et al., 2010). Gauss, 1809, suggested MSE criterion for choice of among estimators
(Vinod and Ullah, 1981).
4.9 Estimation and Testing of Ridge Coecients
Coutsourides and Troskie (1979) and Obenchain, 1975 have shown that for non-
stochastic biasing ridge parameter k yields the same exact t or F -tests for the test of
101
any linear hypothesis as the OLS does. Halawa and El-Bassiouni, 2000, investigated
non-exact t-type test for the individual RR coecients based on t-test. It has been
established that βR ∼ N(ZXβ, φ = ZΩZ ′), where Z = (X ′X+kIp)−1X ′. Therefore,
for jth ridge coecient βR ∼ N(ZjXβ, φjj = ZjΩZ′j) (see Aslam, 2014b; Halawa and
El-Bassiouni, 2000). For testing H0 : βRj = 0 against H1 : βRj 6= 0, Halawa and El
Bassiouni dene a non-exact t-statistic as
Tkj =βRkj
SE(βRkj ),
where βRkj is the j th RR coecient and SE(βRkj ) is an estimate of standard error,
which is the square root of the jth diagonal element of the covariance matrix from
Eq. (4.20).
The statistic tkj is assumed to follow a Student's t distribution with (n − p) d.f.
(Halawa and El-Bassiouni, 2000). Hastie and Tibshirani (1990); Cule and De Iorio
(2012) suggested to use [n − tr(H)] d.f. For large sample size, the asymptotic
distribution of this statistic is normal (Halawa and El-Bassiouni, 2000). Thus reject
H0, when |T | > Z1−α2.
For testing the signicance of the ORRE βR with E(βR) = ZXβ and estimate of
Cov(βR), the F -statistic is
F =1
p(βR − ZXβ)′ (Cov(βR))
−1(βR − ZXβ)
4.10 R Package Development for RR
In this Section, we will illustrate the implementation of linear RR, estimation of ridge
biasing parameters on the Hald data, testing of ridge coecients and dierent ridge
related measures available in literature for the selection of biasing parameter k. The
biasing parameters are computed using methods already available in literature and
102
proposed by dierent authors see Table 4.1. Intercept is not be penalized, however,
it is estimated by the relation in Eq. (4.7).
We developed lmridge package in R language and it contains functions related to
tting of linear RR model.
The lmridge package must be installed and loaded in system memory. Therefore,
after the lmridge package is installed successfully, it needs to be loaded, so that
package functions becomes accessible in the current R session, that is,
> library(lmridge)
For help on dierent functions available in this packages use the following command
(help="lmridge")
The lmridge objects contains a set of standard methods such as print(),
summary(), plot(), predict() etc. Inferences can be made easily using
summary() method for assessing the regression coecients, their standard errors,
t-values and their respective p-values. The basic function of the lmridge() calls
lmridgeEst() which perform estimation for given values of non-stochastic biasing
parameter k. The syntax is,
lmridge(formula , data , scaling , K, ...)
The argument for lmridge are:
Table 4.2: Description of lmridge package arguments
Argument Description
formula: formula is a symbolic representation for linear ridge regression modelof the form response variable ∼ predictors
data data frame contains the variables that have to be used in ridgeregression model
K Biasing parameter, may be a scalar or vector. IfK value is not provided,K = 0 will be used as default value, i.e. OLS results will be produced.
scaling The methods for scaling the predictors. The "sc" option isdefault scaling of the predictors in correlation form, "scaled" optionstandardizes the predictors having zero mean and unit variance while"centered" option centers the predictors.
103
The lmridge() return an object of class "lmridge". The functions summary(),
kest(), and vif() etc., are used to compute and print a summary of RR results,
list of biasing parameters and VIF values etc., respectively after bias introduced.
An object of class "lmridge" is a list that contains the following components:
Table 4.3: Objects from "lmridge" class
Object Description
coef A named vector of tted ridge coecients.
xscale The scales used to standardize the predictors.
xs The scaled matrix of predictors.
y The centered response variable.
Inter whether intercept is include in the model or not.
K The RR (biasing) parameter(s).
xm A vector of means of design matrix X.
ym the mean of response variable.
rt vector(s) of ridge tted values for each biasing parameter k.
d Singular values of the SVD of the scaled predictors.
div Eigenvalues of scaled regressors for each biasing parameter k.
scaling the method of scaling used to standardized the predictors.
call The matched call.
terms The terms object used.
Z A matrix (X ′X + kIp)−1X ′ for each biasing parameter.
For further detail about dierent available functions in the package, see Section 2.1
and the documentation bundled in lmridge package.
4.10.1 Use of lmridge Package
4.10.1.1 Numerical Example
The use of lmridge will be explained through examples by using the Hald data.
104
> data(Hald)
> mod <- lmridge(y~ X1+X2+X3+X4 , data= as.data.frame(Hald), scaling=
"sc", K=c(0.01, 0.05, 0.5, 0.9, 1))
The output of linear RR from lmridge package is assigned to object mod. The
rst argument of package is "formula", which is used to specify the required linear
RR model for the data provided in second argument. By simply typing the object
"mod" at R prompt will yields, object of class "lmridge" with the following descaled
coecients> mod
Call:
lmridge.default(formula = y ~ X1 + X2 + X3 + X4, data = as.data.
frame(Hald), K = c(0.01 , 0.05, 0.5, 0.9, 1), scaling = "sc")
Intercept X1 X2 X3 X4
K=0.01 82.67556 1.3152096 0.3061154 -0.1290181 -0.3429388
K=0.05 85.83062 1.1917230 0.2885040 -0.2179601 -0.3542328
K=0.5 89.19604 0.7882243 0.2709639 -0.3639132 -0.2806435
K=0.9 90.22732 0.6535144 0.2420838 -0.3476935 -0.2415216
K=1 90.42083 0.6285484 0.2353990 -0.3411940 -0.2335823
To get the ridge scaled coecients use mod$coef, the results will be> mod$coef
K=0.01 K=0.05 K=0.5 K=0.9 K=1
X1 26.800306 24.28399 16.061814 13.316802 12.808065
X2 16.500987 15.55166 14.606166 13.049400 12.689060
X3 -2.862655 -4.83610 -8.074509 -7.714626 -7.570415
X4 -19.884534 -20.53939 -16.272482 -14.004088 -13.543744
The object of class "lmridge" returns components such as rfit, K, and coef etc.
For tted model, generic method summary() is used to investigate the ridge
coecients. The parameter estimates of ridge model are summarized using a matrix
of 5 columns for namely estimates, estimates (Sc), StdErr (Sc), t-values (Sc) and
P(>|t|) for ridge coecients. Following results shown only for biasing parameter
k = 0.0132, which produces minimum MSE as compared to others given in
argument.> summary(mod)
Ca l l :
lmr idge . d e f au l t ( formula = y ~ X1 + X2 + X3 + X4 , data = as . data . frame (
Hald ) , K = 0.0132 , s c a l i n g = " sc " )
105
Co e f f i c i e n t s : f o r Ridge parameter K= 0.0132
Estimate Estimate ( Sc ) StdErr ( Sc ) t−value ( Sc ) Pr(>| t | )
Int . 83 .4374 −236.2124 123.6106 −1.911 NA
X1 1.2989 26.4687 3 .7460 7 .066 1 .60 e−12 ∗∗∗X2 0.2997 16.1574 4 .3183 3 .742 0.000183 ∗∗∗X3 −0.1425 −3.1607 3 .6834 −0.858 0.390833
X4 −0.3488 −20.2234 4 .3581 −4.640 3 .48 e−06 ∗∗∗−−−S i g n i f . codes : 0 ' ∗∗∗ ' 0 .001 ' ∗∗ ' 0 .01 ' ∗ ' 0 .05 ' . ' 0 . 1 ' ' 1
Ridge Summary
R2 adj−R2 DF r idge F AIC BIC
0.96870 0.95820 3.02928 134.22765 23.22356 58.27929
Ridge minimum MSE= 390.6084 at K= 0.0132
The summary() function also displays R2, adjusted-R2, df, F -statistics, AIC, BIC
and minimum MSE at certain k given in lmridge(). The kest() function which
works with tted model, computes dierent biasing parameters developed by various
authors, see Table (4.1) . The list of dierent k values (22 in number) may help in
deciding the amount of bias in regression.
>kest(mod)
Ridge k from different Authors
k values
Thisted (1976): 0.00581
Dwividi & Srivastava (1978): 0.00291
LW (lm.ridge) 0.05183
LW (1976) 0.00797
HKB (1975) 0.01162
Kibria (2003) (AM) 0.28218
Minimum GCV at 0.01320
Minimum CV at 0.01320
Kibria 2003 (GM): 0.07733
Kibria 2003 (MED): 0.01718
Muniz et al. 2009 (KM2): 14.84574
Muniz et al. 2009 (KM3): 5.32606
Muniz et al. 2009 (KM4): 3.59606
Muniz et al. 2009 (KM5): 0.27808
Muniz et al. 2009 (KM6): 7.80532
Mansson et al. 2012 (KMN8): 14.98071
Mansson et al. 2012 (KMN9): 0.49624
Mansson et al. 2012 (KMN10): 6.63342
Mansson et al. 2012 (KMN11): 0.15075
Mansson et al. 2012 (KMN12): 8.06268
Dorugade et al. 2010: 0.00000
Dorugade et al. 2014: 0.00000
106
The rstats1() and rstats2() can be used to compute dierent statistics of ridge
biasing parameter such as MSE, Squared Bias, F -test, ridge variance, degrees of
freedom, condition numbers, PRESS and ISRM etc. Following are results using
rstats1() and rstats1() functions, for some (k = 0, 0.1, 0.0132 and 0.2).
> rstats1(mod)
Ridge Regres s ion S t a t i s t i c s 1 :
Variance Bias^2 MSE F r f a c t R2 adj−R2 CN
K=0 3309.5049 0 .0000 3309.5049 125.4142 622.3006 0 .9824 0 .9765 1376.8806
K=0.1 19.8579 428.4112 448.2692 114.1900 3 .3998 0 .8914 0 .8552 22.9838
K=0.0132 65.2404 325.3680 390.6084 134.2277 13.1295 0 .9687 0 .9582 151.7096
K=0.2 16.5720 476.8887 493.4606 87.1322 2 .1649 0 .8170 0 .7560 12.0804
> rstats2(mod)
Ridge Regres s ion S t a t i s t i c s 2 :
CK RSigma^2 DF r idge REDF EF ISRM m s c a l e PRESS
K= 0 6.0000 5 .3182 4 .0000 9 .0000 0 .0000 3 .9872 0 .0000 110.3470
K= 0.1 4 .2246 5 .8409 2 .5646 10.0954 7 .6829 2 .8471 1 .4354 121.2892
K= 0.0132 4 .8560 4 .9690 3 .0293 9 .7974 9 .9570 3 .5806 0 .9707 92.9892
K= 0.2 3 .8630 7 .6547 2 .2960 10.2710 6 .9156 2 .5742 1 .7040 162.2832
The residuals, tted values from RR and predicted values of response variable y can
be computed using functions residual(), fitted() and predict(). To obtain
variance-covariance matrix, VIF and Hat matrix, the function vcov(), vif() and
hatr() can be used. Note that df are computed by following Hastie and Tibshirani
(1990). The results for VIF, V ar-Cov and diagonal elements of the hat matrix from
these functions are given below for k = 0.0132. For detail description see lmridge
package documentation.
> hatr(mod) #hat matrix for all K's
> hatr(mod)[[1]] #hat matrix for first K
> diag(hatr(mod)[[1]]) #diagonal element for first K
> predict(mod) #predicted values
> mod$rfit #ridge fitted values
> resid(mod) #residuals for given K
> infocr(mod) #AIC and BIC values
> vif(mod) #vif values
X1 X2 X3 X4
K=0 38.4962115 254.4231659 46.8683863 282.5128648
K=0.1 1.2839005 0.5157626 1.2040981 0.3960310
K=0.0132 2.8239987 3.7528395 2.7303625 3.8223356
K=0.2 0.7868168 0.3453035 0.7519627 0.2808451
107
> vcov(mod)
$`K=0.0132 `
X1 X2 X3 X4
X1 14.0324002 0.6832754 11.014289 3.062890
X2 0.6832754 18.6477943 1.957465 16.115540
X3 11.0142893 1.9574655 13.567124 3.404259
X4 3.0628896 16.1155403 3.404259 18.993119
For given value of X such as for rst ve rows of X matrix, the predicted values will
be,
> predict(mod , newdata=as.data.frame(Hald [1:5 ,-1]))
K=0 K=0.1 K=0.0132 K=0.2
1 78.49524 79.75123 78.54153 80.73845
2 72.78880 74.32685 73.15534 75.38193
3 105.97094 106.04949 106.39587 105.62443
4 89.32710 89.52352 89.48523 89.65431
5 95.64924 96.56705 95.75191 96.99775
4.10.1.2 Graphical Results
The eect of multicollinearity on the coecient estimates can be identied using
dierent graphical displays such as ridge, VIF and df traces, plotting of RSS against
df, and PRESS vs k etc. Therefore, for selection of optimal k using subjective
(judgmental) methods, dierent plot functions are also build. For example, the ridge
or vif trace plot can be plotted using plot() function. The argument to plot functions
are abline=TRUE and type=c("ridge", "vif"). By default, ridge trace will be
plotted having horizontal line on y-axis at y = 0 and vertical line at minimum GCV
for given k on x-axis.
> mod <-lmridge(Y~.,data=dt, K=seq(0, 0.5, 0.001))
> plot(mod)
> plot(mod , type="vif", abline=FALSE)
> plot(mod , type="ridge", abline=TRUE)
108
The bias-variance tradeo plot can be used to select optimal k using bias.plot()
function.
> bias.plot(mod , abline=TRUE))
Figure 4.3: Bias Variance Tradeo
111
The plot of model selection criteria AIC and BIC for choosing optimal k,
info.plot() function can be used
> info.plot(mod , abline=TRUE)
Figure 4.4: ACI and BIC model selection Criteria
112
Function cv.plot() plots the CV and GCV cross validation against biasing
parameter k for the optimal selection of k, that is,
> cv.plot(mod , abline=TRUE)
Figure 4.5: CV and GCV, Cross Validation Plots
113
Vinod (1976) measures of m-scale and ISRM can also be plot from function of
isrm.plot() and can be used to judge the optimal value of k.
> isrm.plot(mod)
Figure 4.6: m-scale and ISRM Plots
114
Function rplots.plot() plots panel of three plots namely i) df trace, ii) RSS vs k
and PRESS vs k and can be used to judge the optimal value of k.
> rplots.plot(mod)
Figure 4.7: Miscellaneous Ridge Plots
115
Chapter 5
The Liu Estimator: A Concise
Review and R Package Development
In case of existence of severe collinearity (non-orthogonal problems) among regressors,
the ridge biasing parameter k selected by dierent existing methods may not fully
address the problem of ill-conditioning. For larger value of k, the distance between
estimated and true value of regression parameter increases.
The ridge coecient βk is a complicated function of k when some popular methods
such as Golub et al. (1979), Mallows (1973) and McDonald and Galarneau (1975) etc.
are used for (optimal) selection of k. Usually k is quite small in dierent applications,
that's why, selection of small k may not be enough to correct the problem of ill-
condition X's. In such cases, the RR may still be unstable. Similarly, the choice of
k belongs to the researcher and also there is no consensus regarding how to select
optimal k, therefore other innovative methods were needed to deal with collinear
data.
In the literature, mixed regression estimation (MRE) and ridge type regression are
suggested to overcome the collinearity eect on regressors. To deal with multicollinear
data, Liu (1993) formulated a new class of biased estimate that has combined benets
116
of ORR by Hoerl and Kennard (1970b), see Eq., 4.5 and Stein type estimator (1956),
βs = cβ where c is parameter 0 < c < 1 and avoid their disadvantages. The LE can
be written as
βd = (X ′X + Ip)−1(X ′y + dβ) ,
= (X ′X + Ip)−1(X ′X + dIp)β ,
= Fd β , (5.1)
where d is the Liu parameter also known as biasing (tuning or shrinkage) parameter
and lies between 0 and 1 (i.e. 0 ≤ d ≤ 1), Ip is identity matrix of order p ×
p, β is OLSE and Fd = (X ′X + Ip)−1(X ′X + dIp). The βd is named as the Liu
estimator (LE) by Akdeniz and Kaciranlar (1995) and Gruber (1998). Recently, in
econometrics, engineering and other statistical areas, the LE has produced a number
of new techniques and ideas, see for example redAkdeniz and Kaciranlar (2001);
Hubert and Wijekoon (2006); Jahufer and Chen (2009, 2011, 2012); Kaciranlar and
Sakalho§lu (2001); Kaciranlar et al. (1999); Torigoe and Ujiie (2006).
Augmenting 0 =√k β + ε and d βd = β + ε to Eq. (2.1), βR and βd, can be obtained
by using the OLS method, respectively. Like ridge, Liu regression also penalizes the
coecient's size and d controls the strength of penalty term used in model.
However, Liu (2011) and Druilhet and Mom (2008) have made statement that the
biasing parameter d may lie outside the range given by Liu (1993), that is, it may
be less than 0 or greater than 1. The LE is linear transformation of the OLSE, that
is, LE is shrunken estimator of the OLS for d = 1 and βd = βols. The linear LE has
been widely used and is also powerful tool for detection of inuentials observations.
The suitable selection of d at which MSE is minimum and eciency of estimators
improves as compared to other values of d is the main interest of LE. Liu (1993)
provided some important methods for the selection of d and also provided numerical
example by iterative minimum MSE method to get the smallest possible value to
117
overcome the problem of collinearity in an eective manner.
5.1 Reparameterization
It is encouraged that Xn×p and yn×1 should be standardized rst such that
information matrix X ′X is in the correlation form and vector X ′y is in form of
correlation among regressors and the response. Therefore, for estimation of the Liu
parameter, the regressors and response variable are centered.
Consider regression model, y = β01 + Xβ1 + ε, where X is centered and
1 = c(1, 1, · · · , 1)′, while β0 can be estimated by using y. Let
λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0, be the ordered eigenvalues of matrix X ′X and q1, q2, · · · , qpbe the eigenvectors corresponds to their eigenvalues, such that Q = (q1, q2, · · · , qp)
is orthogonal matrix of X ′X and Λ =
λ1 . . .λ
, therefore model can be
rewritten in canonical form as y = β01 + Zα + ε, where Z = XQ and α = Q′β1.
Note that, Λ = Z ′Z = Q′X ′XQ. The estimate of α is α = Λ−1Z ′y. Similarly, Eq.
(5.1) can be rewritten in canonical form as
αd = (Λ + Ip)−1(Z ′y + dα)
Corresponding estimate of β1 and βd can be obtained by following relation of β1 =
Qα and βd = Qαd, respectively. For simplication of notations, X and α will be
represented as X and β, respectively.
The tted values of the LE can be found using Eq. (5.1),
yd = Xβd
= X(X ′X + Ip)−1(X ′y + d)β
= Hd y,
118
where,
Hd = X(X ′X + Ip)−1(X ′X + dIp)(X
′X)−1X ′, (5.2)
is LE hat matrix (see Liu, 1993; Walker and Birch, 1988). It is worthy to note
that Hd is not idempotent because it is not projection matrix therefore it is called
quasi-projection matrix.
As βd is computed on centered variables, so they need back to the original scale, that
is,
βd =
(βdjSxj
)(5.3)
The intercept term for the LE (βd0) can be estimated using the following relation,
βR0d = y − (β1d, · · · , βpd)x′ ,
= y −p∑j=1
xjβjd . (5.4)
5.2 Properties of Liu Estimators
Like the linear RR, Liu is also the most popular method among biased methods,
because of its relation to the OLS and its statistical properties have been studied
by Akdeniz and Kaciranlar (1995, 2001), Arslan and Billor (2000), Kaciranlar and
Sakalho§lu (2001), Kaciranlar et al. (1999) and Sakalho§lu et al. (2001) among many
others. Due to comprehensive properties of the LE, researcher have been attracted
towards this area of research.
For d = 1, βd = βols. Therefore, LE is the shrinkage estimator, though biased but
have lower MSE than OLS that is, MSE(βd) < MSE(β).
Let Xj denote the jth column of X (1, 2, · · · , p), where Xj = (x1j, x2j, · · · , xnj)′. As
119
already discussed, the regressors are centered, thus, the intercept will be zero and
can thereby be removed from the model. However, it can be estimated from relation
given in Eq. (5.4).
5.2.1 Mean of βd
Since βd is biased, the expected value of Eq. (5.1) is
E(βd) = Fdβ ,
= (X ′X + Ip)−1(X ′X + dIp)β ,
E(αd) = (Λ + Ip)−1(Λ + d Ip)α .
when d = 1, βd = βls. Therefore, βd is a biased shrinkage estimate.
5.2.2 Var-Cov matrix of βd
The Var-Cov (dispersion) matrix of βd from Kaciranlar et al. (1999) is
Cov(βd) = σ2Fd(X′X)−1F ′d ,
Cov(αd) = σ2(Λ + Ip)−1(Λ + dIp)Λ
−1(Λ + dIp)(Λ + Ip)−1 ,
where σ2 is computed with df from Hastie and Tibshirani (1990).
120
5.2.3 MSE of βd
The MSE of βd obtain from
MSE(βd) = traceCov(βd) + ‖E(αd)− α‖2 ,
MSE(βd) = σ2Fd(X′X)−1F ′d + (Fd − Ip)ββ′(Fd − Ip)′ ,
MSE(αd) = σ2
p∑j=1
(λj + d)2
λj(λj + 1)2+ (d− 1)2
p∑j=1
α2j
(λj + 1)2.
Liu (1993) showed that the MSE of the LE is less than that of the OLSE at d
(0 < d < 1) for all values of β and σ2.
5.2.4 Linear Transformation
The LE (βd) is a linear transformation of the OLSE (β), that is,
βd = Fd β,
where, Fd = (X ′X + Ip)−1(X ′X + d Ip)β.
5.2.5 Wide Range of Biasing Parameter
There always exists a wide range of biasing parameter d having smaller MSE than
the OLS.
5.2.6 Optimal value of d
There always exists an optimal Liu parameter d (say dopt) which gives minimum
MSE, (see Akdeniz and Kaciranlar, 1995; Liu, 1993; Sakalho§lu et al., 2001) among
many others.
121
5.2.7 Eective Degrees of Freedom (EDF)
The EDF allows to interpret the impact of penalty. The hat matrix from Eq. (5.2)
can be used to compute the EDF, dened as
dfLd = trace[Hld] .
The error df can be given by n− trace(2H −HH ′) and used in denominator of the
σ2 estimate. Therefore, the trace(2H − HH ′) is the eective number of parameter
in the df of error.
5.3 Methods of Selecting values of d
The existing methods to select biasing parameter in the RR may not fully address
the problem of ill-conditioning when there exists severe multicollinearity, while the
appropriate selection of biasing parameter d also remains a problem of interest. The
biasing parameter d should be selected when there are improvements in the estimates
(have stable estimates) or prediction is improved.
5.3.1 MSE Minimizer
Liu (1993) proposed to nd the optimal d that minimizes MSE and MSE(β1d) <
MSE(β1).
dopt =
∑pj=1
[α2j−σ2
(λj+1)2
]∑p
j=1
[σ2+λjα2
j
λj(λj+1)2
] .TheMSE(αd) is minimized at dopt. Replacing α2
j and σ2 by their unbiased estimates
α2i − σ2
λjand σ2 respectively. The estimate of dopt is called as minimum MSE estimate
122
by Liu (1993).
d = 1− σ2
p∑j=1
1λj(λj+1)
p∑j=1
α2j
(λj+1)2
.
The biasing parameter d named as improved Liu estimator (ILU) proposed by Liu
(2011) using PRESS criteria is,
dimp =
n∑i=1
e′
1−gii
(e
1−h1−ii −e
1−hii
)n∑i=1
(e′
1−gii −e
1−hii
)2 ,
where, e = yi−x′i(X ′X−xix′i)−1(X ′y−xiyi) and e′ = yi−x′i(X ′X+Ip−xix′i)−1(X ′y−
xiyi) with gii and hii are the ith diagonal elements of hat matrix dened as G =
X(X ′X + Ip)−1X ′ and H ∼= X(X ′X)−1X ′ respectively. Though ILE is biased but
yield lower MSE than the LE.
In literature, many methods for selection of appropriate biasing parameter d have
been studied by Akdeniz and Oøzkale (2005), Arslan and Billor (2000), Akdeniz et al.
(2006), Oøzkale and Kaciranlar (2007) and Liu (1993).
5.3.2 Performance Criteria
The value of biasing parameter d must be selected such that βd improves β in the
sense of MSE. In other words, there always exists a certain value of d such that
MSE(βd) < MSE(β). Liski (1982) proposed MSE as a powerful criteria to select
123
between shrinkage estimator and LSE. The MSE of LE is,
MSE(βd) = traceCov(βd) + ‖E(βd)− β‖2 ,
= σ2
p∑j=1
(λj + d)2
λj(λj + 1)2+ (d− 1)2
2∑j=1
β2
(λ+ 1)2.
Liu (1993) showed that their proposed estimator (dopt) is superior to the OLSE both
in the sense of scalar and matrix MSE.
For prediction purposes, it is appropriate to use prediction-oriented criteria for the
selection of d. Liu (2011) derived the improved Liu estimator (ILE) and recommended
PRESS criteria for the selection of optimal d.
• PRESS
To improve the quality of model prediction performance, minimizing the PRESS
is another alternative suggested by Allen (1974) and further developed by Golub
et al. (1979). Small value of PRESS statistics represents a model having smaller
MSE and hence the regression model will provide good predictions of new
observations.
For PRESS statistics, Liu (1993) did not give the minimizer for LE, however
Oøzkale and Kaciranlar (2007) proposed the PRESS statistics for the LE and
also provide an example based on data for Portland cement.
The PRESS statistics for LE
PRESSd =n∑i=1
(ed(i))2 ,
=n∑i=1
[(1− d)
eRi(1)
1− h1−ii+ d
ei1− hii
]2.
• CL Criteria
Liu (1993) also provided other estimates of d by analogy with the estimate of k
in the RR and suggested CL statistics like Cp of Mallows (1973) for the selection
124
of d that minimizes CL value.
CL =SSRd
σ2+ 2trace(Hd)− (n− 2)
where SSRd is SSR from Liu regression at specic d, σ2 is the estimate of σ2
from LS regression and Hd is hat matrix of LE given in Eq. (5.2). Since, CL is
a quadratic function of d, the minimum of CL can be obtained by using,
dCL = 1− σ2
p∑i=1
1λi+1
p∑j=1
λiα2
(λi+1)2
.
Plotting CL against d for which CL is minimized with the use of d values.
• GCV
For LE, the prediction oriented-method GCV, is another good method to select
appropriate d,
GCVd =SSRd
[n− [1 + trace(Hd)]]2
The minimizer of CL and GCVd is the same Liu (1993) and d can be selected
by minimizing GCV and CL, respectively. The minimizer dLC of GCV and CL
is,
dCL = 1− σ2
p∑i=1
1λi+1
p∑j=1
λiα2
(λi+1)2
5.3.3 Using Information Criteria
Information criteria such as AIC and BIC can be used to get some idea about
appropriate selection of d. An estimate of AIC and BIC can be computed, as given
125
below,
AIC = n log(RSS) + 2 df ,
BIC = n log(RSS) + df log(n) ,
where df is degrees from freedom, computed by following Hastie and Tibshirani
(1990), see Section (5.2.7) and (4.6.16).
5.3.4 Subjective Methods
For selection of biasing parameter d, many methods are available in literature. The
graphical evidence of the eect of multicollinearity on the regression coecient and
account of variation by LE compared to the LSE can be judged by graphing the Liu
coecients, MSE, variance and bise due to biasing parameter d.
• Liu Trace
The plotting of Liu coecient as a function of biasing parameter d can be used
to depict the eect of collinearity on each of the coecients. For certain value
of d the trace appear to be more stable as compared to smaller or higher of d
than optimal d.
• Bias Variance trade-o
Plotting bias, variance and MSE from the LE can be helpful in selecting
appropriate value of d. At the cost of bias optimal d can be selected at which
MSE is minimum.
5.4 Estimation and Testing of Liu Coecients
Testing of Liu coecients is performed by following Aslam (2014b) and Halawa and
El-Bassiouni (2000). For testing H0 : βdj = 0 against βdj 6= 0, the non-exact t-
126
statistics dened by Halawa and El-Bassiouni (2000) is
Tdd =βdj
SE(βdj),
where βdj is the jth Liu coecient and SE(βdj) is an estimate of standard error,
which is the square root of the jth diagonal element of the covariance matrix from
Section (5.2.2).
The statistic tdj is assumed to follow Student's t distribution with (n − p) df
(Halawa and El-Bassiouni, 2000). Hastie and Tibshirani (1990) and Cule and
De Iorio (2012) suggested to use df from (n − tr(Hd)). For large sample size, the
asymptotic distribution of this statistic is normal Halawa and El-Bassiouni (2000).
Thus reject H0 when |T | > Z1−α2.
For testing the signicance of LE βd with E(βd) = Fdβ and Cov(βd), the F -statistic
is
F =1
p(βd − Fdβ)′ (Cov(βd))
−1(βd − Fdβ)
5.5 R Package Development and Implementation
In this section, we will illustrate the development and implementation of the linear
Liu regression, estimation of Liu biasing parameter d, testing of the Liu coecients
and dierent Liu related measures by applying them to a data set. The biasing
parameters are computed by following Liu (1993). Note that if intercept is present in
the model, then it will not be penalized, however, will be estimated by the relation
dened in equation (5.4).
We developed liureg package in R language and it contains functions related to
tting linear Liu regression model. The liureg package estimates coecients from
Liu regression and also performs testing of these coecients. This package also
computes dierent properties of LE, biasing parameters residual, tted & predicted
127
values, and graphical representation variation in bias, variance and MSE of LE etc.
In following section, we will discuss the use and development of liureg package
following Chapter 2. The procedure of liureg package development is similar to the
package developed for RR in Chapter 2. Therefore, only part which is dierent from
ridge will be presented to avoid duplication programing code. The programming code
for the liureg package are only for basic computation of biasing parameter d and
similarly partial codding of some function is presented here.
5.5.1 Liu Package Development
To compute βd, regressors and response variable are centered as suggested by Liu
(1993). However, in liureg package developed, one can also standardize or scale the
regressors instead of centering of regressors.
liuest <-
function(formula , data , d=1.0, scaling = c("centered", "sc",
"scaled"), ...)
if (is.null(d))
d <- 1
else
d <- d
.
.
.
bols <- lm.fit(X , as.matrix(Y))$coefficients
coef <-lapply(d, function(d)(solve(t(X)%*%X+diag(p))%*%(t(X)%*%X
+d*diag(p)))%*%bols)
coef <-do.call(cbind , coef)
rownames(coef)<-colnames(X)
colnames(coef)<-paste("d=", d, sep="")
lfit <- apply(coef , 2, function(x)X%*%x)
list(coef=coef , xscale=Xscale , xs=X, Inter=Inter , xm=Xm, y=Y,
scaling=scaling , call=match.call(), d=d, lfit=lfit , mf=mf)
The liuest() function computes estimate of the Liu coecient and other required
statistics such as the Liu tted values and scaling of regressors by following formula
128
interface. To perform further computation on objects, the main function liu()
is made generic and have default method (liu.default) whose rst argument is a
design marix (or something other data structure which can be converted to a matrix).
This default method calls liuest() and returns the object having class names as
"liu".
Like lmridge package, the Liu coecients are rescaled using the relation described
in Eq. (5.3). Similarly, print() method prints the Liu coecients for each biasing
parameter d in standard format. For estimation and testing of Liu coecients
summary() method is used to summarize the model for each given value of d. The
output of summary method contains 5 columns namely "Estimate", "Estimate
(Sc)", "StdErr (Sc)", "t-val (Sc)" and "Pr(>|t|)" for given biasing parameter(s).
For computation of dierent d values proposed by Liu (1993), the dest() function
can be used. The R code for computation is,
dest <- function(object ,...)
UseMethod("dest")
dest.liu <- function(object ,...)
x <- object$xs
y <- object$y
p <- ncol(x)
n <- nrow(x)
d <- object$d
EVal <- eigen(t(x) %*% x)$values
EVec <- eigen(t(x) %*% x)$vectors
ols <- lm.fit(x, y)
coefols <- ols$coef
fittedols <- ols$fitted.values
residols <- ols$residuals
sigma2 <- sum(residols ^ 2) / (n - p)
alphaols <- t(EVec) %*% coefols
rownames(alphaols) <- colnames(x)
diaghat <- lapply(hatl(object), function(x) diag(x) )
diaghat <- do.call(cbind , diaghat)
SSER <-lstats(object)$SSER
129
GCV <-matrix(0,1,nrow=length(d) )
for(i in seq(length(d)))
GCV[i,]<-SSER[i]/(n-1-sum(diaghat[,i]))^2
rownames(GCV) <- paste("d=", d, sep = "")
colnames(GCV) <- c("GCV")
if (length(GCV) > 0)
l <- seq_along(GCV)[GCV == min(GCV)]
dGCV <- object$d[l]
dopt <-(sum(( alphaols^2-sigma2)/(EVal +1) ^2))/(sum(( sigma2+EVal*
alphaols ^2)/(EVal*(EVal +1)^2)))
numdmm <- 1 / (EVal * (EVal + 1))
dnumdmm <- (alphaols ^ 2) / ((EVal + 1) ^ 2)
dmm <- (1 - sigma2) * sum(numdmm) / sum(dnumdmm)
numdcl <- 1 / (EVal + 1)
dnumdcl <- (EVal * alphaols ^ 2) / ((EVal + 1) ^ 2)
dcl <- (1 - sigma2) * sum(numdcl) / sum(dnumdcl)
desti <- list(dmm = dmm , dcl = dcl , GCV=GCV , dGCV=dGCV , dopt=dopt ,
sigma2=sigma2)
class(desti) <- "dliu"
desti
print.dliu <- function(x,...)
cat("Liu biasing parameter d\n")
dest <- cbind(dopt=x$dmm , dd=x$dcl , dopt=x$dopt , dGCV=x$dGCV)
rownames(dest) <- "d values"
colnames(dest) <- c("dmm", "dcl", "dopt", "min GCV at")
print(t(round(dest ,5)) ,...)
Following is complete list of functions available in liureg package for computation
of the LE related statistics and optimal value of d from Liu (1993).
130
Table 5.1: Functions and methods available in liureg package.
Functions Description
liuest() The main model tting function for implementation of the Liu
regression models in R.
liu() Generic function and default method that calls liuest function and
returns an object of S3 class "liu" with dierent set of methods
to standard generics. It has print method to display de-scaled Liu
coecients
summary() Standard Liu regression output (coecient estimates, scaled coecients
estimates, standard errors, t-values and p-values); returns an object of
class "summaryliu" containing the relative summary statistics and have
a print() method.
coef() Computes re-scaled Liu coecients see Eq. (5.3)
vcov() Displays associated variance-covariance matrix with matching the Liu
parameter d values
predict() Produces predicted value(s) obtained by evaluating the liuest function
in the frame newdata
tted() Displays the Liu tted values for observed data
residuals() Displays the Liu residuals values
dest() Displays various d (biasing parameter) values from Liu (1993)
lstats() Displays dierent statistics of the Liu regression such as mse, bias, etc
plot() Liu trace of biasing parameter d.
plot.biasliu() Bias, variance trade-o plot as function of d
plot.infoliu() Plot of AIC and BIC against d
hatl() Displays hat matrix from the Liu regression
infoliu() Compute information criteria AIC, BIC
For selection of appropriate d, the variance-bias trade-o plot can be carried out by
the following function
plot.rbias <- function(x, abline=TRUE ,...)
bias2 <-lstats(x)$bias2
131
var <-lstats(x)$var
mse <-lstats(x)$mse
minmse <-min(mse)
mind <-x$d[which.min(mse)]
col=cbind("black", "red", "green")
liutrace <-cbind(var , bias2 , mse)
if(length(x$d)==1)
plot(x=rep(x$d, length(liutrace)), y=liutrace , main="Bias ,
Variance Trade -off",
xlab="Liu Biasing Parameter", ylab="", col=col , lwd=2, lty=
c(1,4,5))
legend("topright", legend=c("var", "bias^2","mse"),col=col , lwd
=2, fill =1:3,
lty=c(1,4,5), cex=.6, pt.cex=.7, bty="o", y.intersp = .7)
else
matplot(x$d, liutrace , main="Bias , Variance Trade -off", xlab="
Liu Biasing Parameter",
col=col , lwd=2, lty=c(1,4,5), type='l')
legend("topright", legend=c("var", "bias^2", "mse"), col=col ,
lwd=2, fill =1:3,
lty=c(1,4,5), cex=.6, pt.cex=.5, bty="o", y.intersp = .7)
if(abline)
abline(v=mind , lty=2)
abline(h=minmse , lty=2)
text(mind , max(lstats(x)$mse), paste("mase=",round(minmse ,3)),
col="blue", pos=1)
text(mind , minmse , paste("d=", mind), pos=4, col="blue")
5.5.2 The Liu Package Implementation
The liureg package must be installed and loaded in system memory, so that package
functions becomes accessible in the current R session, that is,
> library(liureg)
For a complete list of functions available in liureg package use the command,
> help("liu")
132
The liureg package contains a set of standard methods such as print(), summary(),
plot() and predict() etc. and return objects of class "liu". Inferences can be
made easily by using summary() method for assessing the regression coecients,
their standard errors, t-values and their respective p-values. The function liu() is
set as default function to call liuEst which performs computation for given values
of biasing parameter d. The syntax is,
liu(formula , data , scaling , d, ...)
The arguments for liu are;
Table 5.2: Description of liureg Package's argument
Argument Description
formula A symbolic representation for linear Liu regression model of the
form response variable ∼ predictors
data data frame that contains the variables and have to be used in
the model
d Biasing parameter, may be a scalar or vector. If d value is not
provided, d = 1 will be used as default value, that is, linear
regression (d = 1) results will be produced
scaling The methods for scaling the predictors. The "centered"
option, centers the predictors and is default scaling options for
predictors, the "sc" option, scales the predictors in correlation
form as described in Belsley (1991); Draper and Smith (1998)
and "scaled" option standardizes the predictors having zero
mean and unit variance. Note that the response variable is
"centered" for all scaling option.
The liu() return an object of class "liu". The function summary(), dest() and
lstats() etc., can be used to compute and print (display) a summary of Liu
regression results, list of biasing parameters and optimal d by following Liu (1993)
respectively, after bias is introduced in regression model. An object of class liu()
133
is a list that contains the following components:
Table 5.3: Objects from "liu" class
Object Description
coef A vector of scaled tted Liu coecients
xs The scaled matrix of predictors according to scaling option in liu()
function
Inter Whether intercept is included in the model or not
d The Liu regression (biasing ) parameter(s), it can be scaler or vector
xm A vector of means of design matrix X
ym The mean of response variable
scaling The methods of scaling used to standardize the predictors
5.5.2.1 Numerical Results
A numerical example is presented for the Hald's data set to explain the use of liureg
to produce linear Liu regression results. The Hald data is already included in package
and can be used by following codes give below,
> data(Hald)
> mod <- liu(y~X1+X2+X3+X4 , data=as.data.frame(Hald), scaling="
centered", d=seq(0, 1, 0.001) )
> mod
Call:
liu.default(formula = y~X1+X2+X3+X4, data=as.data.frame(Hald),
scaling = "centered", d = seq(0, 1, 0.01))
Intercept X1 X2 X3 X4
d=0 75.01755 1.41348 0.38190 -0.03582 -0.27032
d=0.01 74.89142 1.41486 0.38318 -0.03445 -0.26905
d=0.49 68.83758 1.48092 0.44475 0.03167 -0.20845
d=0.5 68.71146 1.48229 0.44603 0.03304 -0.20719
d=0.9 63.66659 1.53734 0.49734 0.08814 -0.15669
d=1 62.40537 1.55110 0.51017 0.10191 -0.14406
The output of the linear LR from liureg package is assigned to object mod. The rst
argument of function liu is "formula" which is used to specify the required linear LR
134
model for the data provided as second argument while regressors are scaled according
to 3rd argument. Typing the "mod" at R prompt will yields list of de-scaled Liu
coecients for each biasing parameter provided in 4th argument.
To get the scaled Liu coecients, use mod$coef and results will be displayed in R
console. Few lines of output is
> round(mod$coef , 5)
d=0 d=0.01 d=0.49 d=0.5 d=0.9 d=1
X1 1.41348 1.41486 1.48092 1.48229 1.53734 1.55110
X2 0.38190 0.38318 0.44475 0.44603 0.49734 0.51017
X3 -0.03582 -0.03445 0.03167 0.03304 0.08814 0.10191
X4 -0.27032 -0.26905 -0.20845 -0.20719 -0.15669 -0.14406
The object of class "liu" returns components such as lfit(), d(), coef() and
coef() etc. For tted model, generic method summary() is used to investigate the
liu coecients. The parameter estimates of Liu model are summarized using a matrix
of 5 column output of Liu regression namely Estimate, Estimate (Sc), StdErr (Sc),
t-val (Sc) and Pr(>|t|). Following results shown are only for biasing parameter
d = −1.47218; a value at which minimum MSE occurs.
> summary(mod)
Call:
liu.default(formula = y ~ X1+X2+X3+X4, data= as.data.frame(Hald),
scaling="centered", d = -1.47218)
Coefficients for Liu parameter d= -1.47218
Estimate Estimate (Sc) StdErr (Sc) t-val (Sc) Pr(>|t|)
Intercept 93.5849 93.5849 17.5448 5.334 NA
X1 1.2109 1.2109 0.2711 4.466 7.97e-06 ***
X2 0.1931 0.1931 0.2595 0.744 0.4568
X3 -0.2386 -0.2386 0.2671 -0.893 0.3717
X4 -0.4562 -0.4562 0.2507 -1.820 0.0688 .
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Liu Summary
R2 adj -R2 F AIC BIC MSE
d= -1.47218 0.9819 0.8372 127.8 23.95 59.18 1.39
The function dest() computes dierent biasing parameters suggested by Liu (1993)
as MSE minimizer of LE. The biasing parameter by Liu includes dCL, dmm and dopt.
135
For appropriate selection of d, GCV is also computed. The argument of dest() is
object of class "liu", i.e., the object that is used to store results from liu() function,
in this example is "mod". The output of function for the Hald data set is,
> dest(mod)
Liu biasing parameter d
d values
dmm -5.61494
dcl -5.66240
dopt -1.47218
min GCV at 0.00000
The lstats() computes dierent statistics for biasing parameter of LE such as mse,
squared bias, F-test, liu variance, degrees of freedom for Hastie and Tibshirani (1990),
R2 etc. Following are results for some d = −0.06, 0, 0.1, 0.05 and 1.> l s t a t s (mod)
Liu Regres s ion S t a t i s t i c s :
EDF Sigma2 CL VAR Bias^2 MSE F R2 adj−R2d=−0.06 9 .0760 5 .2989 5 .5077 1 .0195 0 .2050 1 .2246 125.8693 0 .9823 0 .8406
d=0 9.0677 5 .3010 5 .5315 1 .0625 0 .1825 1 .2450 125.8194 0 .9823 0 .8407
d=0.1 9 .0548 5 .3043 5 .5722 1 .1362 0 .1478 1 .2840 125.7427 0 .9823 0 .8408
d=0.5 9 .0169 5 .3139 5 .7488 1 .4561 0 .0456 1 .5018 125.5157 0 .9824 0 .8412
d=1 9.0000 5 .3182 6 .0000 1 .9119 0 .0000 1 .9119 125.4141 0 .9824 0 .8414
minimum MSE occurred at d= −0.06
The residuals, tted values from the Liu regression and predicted values of response
variable y can be computed using functions such as residual(), fitted() and
predict() respectively. For computation of variance-covariance and hat matrix for
Liu regression vcov() and hatl() can be used respectively for computation of other
statistics not available in this package. These functions computes corresponding
statistics for each value of d and can be used as,
> hatl(mod) # hat matrix for each d
> hatl(mod)[[1]] # hat matrix for first value of d
> residual(mod) # residuals for each d
> fitted(mod) # fitted value for each d
> vcov(mod) # Var -Cov matrix for each d
> diag(hatl(mod)[[1]]) # diagonal elements for first d
> predict(mod) # predicted values for each d
136
For given value of X such as for rst ve rows of X matrix, the predicted values will
be,
> predict(mod , newdata=as.data.frame(Hald [1:5 ,-1]))
d= -0.06 d=0 d=0.1 d=0.5 d=1
1 78.40208 78.40736 78.41615 78.45130 78.49524
2 72.91968 72.91227 72.89992 72.85053 72.78880
3 106.27656 106.25926 106.23043 106.11510 105.97094
4 89.41842 89.41325 89.40463 89.37017 89.32710
5 95.63443 95.63527 95.63667 95.64226 95.64924
The model selection criteria's of AIC and BIC can be computed using infoliu()
function for each value of d used in argument of liu() function, that is,
> infoliu(liu(y~ X1+X2+X3+X4, data=as.data.frame(Hald), d=c(-0.06,
0, 0.1, 0.5, 1)))
AIC BIC
d= -0.06 24.43818 59.88178
d=0 24.46352 59.91621
d=0.1 24.50663 59.97446
d=0.5 24.69007 60.21849
d=1 24.94429 60.54843
The model selection criteria's are carried out by the following function
infoliu <-function(x,...) UseMethod("infoliu")
infoliu.liu <-function(object , ...)
SSER <-apply(resid(object) ,2, function(x)sum(x^2))
df <- as.vector(lapply(hatl(object), function(x)
sum(diag(x))
))
n<-nrow(object$xs)
AIC <-mapply(function(x,y)n*log(x/n)+2*y, SSER , df , SIMPLIFY
= FALSE)
AIC <-do.call(cbind ,AIC)
BIC <-mapply(function(x,y)n*log(x)+y*log(n), SSER , df,
SIMPLIFY = FALSE)
BIC <-do.call(cbind ,BIC)
resinfo <-rbind(AIC , BIC)
rownames(resinfo)<-c("AIC", "BIC")
t(resinfo)
137
Other examples and use of dierent functions can be seen from the R documentation
of liureg package. The coding of some function (such as summary(), predict(),
fitted() and resid() etc.) are not presented here to avoid duplication or repetition
of content with minor dierences and also to avoid huge volume of pages.
5.5.2.2 Graphical Results
The eect of biasing parameter d on (i) Liu coecients, (ii) variation in bias,
variance, and MSE, can be explored by using plot() and plot.biasliu()
functions respectively. Similarly, plot.infoliu can be used to choose d value by
plotting information criteria's such as AIC and BIC against given values of d as
liu() function's argument. The plot of Liu coecients against biasing parameter
d, can be achieved,
> plot(mod)
Figure 5.1: Liu Trace: Liu Coecient against Biasing Parameter d
138
To get bias variance trade-o plot, use command,
> plot.biasliu(mod)
Figure 5.2: Bias Variance trade-o
139