Ecuaciones Taylor

17
Computational Statistics and Data Analysis 57 (2013) 95–111 Contents lists available at SciVerse ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Exact methods for variable selection in principal component analysis: Guide functions and pre-selection Joaquín Pacheco , Silvia Casado, Santiago Porras Departamento de Economía Aplicada, Universidad de Burgos, Spain article info Article history: Received 16 November 2011 Received in revised form 5 June 2012 Accepted 13 June 2012 Available online 18 June 2012 Keywords: PCA Variable selection Branch & Bound methods Guide functions Filters abstract A variable selection problem is analysed for use in Principal Component Analysis (PCA). In this case, the set of original variables is divided into disjoint groups. The problem resides in the selection of variables, but with the restriction that the set of variables that is selected should contain at least one variable from each group. The objective function under consideration is the sum of the first eigenvalues of the correlation matrix of the subset of selected variables. This problem, with no known prior references, has two further difficulties, in addition to that of the variable selection problem: the evaluation of the objective function and the restriction that the subset of selected variables should also contain elements from all of the groups. Two Branch & Bound methods are proposed to obtain exact solutions that incorporate two strategies: the first one is the use of ‘‘fast’’ guide functions as alternatives to the objective function; the second one is the preselection of variables that help to comply with the latter restriction. From the computational tests, it is seen that both strategies are very efficient and achieve significant reductions in calculation times. © 2012 Elsevier B.V. All rights reserved. 1. Introduction Data sets with large numbers of variables are processed in many disciplines such as economics, sociology, engineering, medicine and biology, among others. The researcher has to process a lot of data classified by a large number of variables, which are often difficult to summarize or interpret. One useful approach involves the reduction of data dimensionality, while trying to preserve as much of the original information as possible. A common way of doing this is through Principal Component Analysis (PCA). PCA is widely applied in data mining to investigate data structures. Its purpose is to construct components, each of which contains a maximal amount of variation with respect to the data that are unexplained by any other components. Standard results guarantee that retaining the k Principal Components (PCs) with the largest associated variance produces the k-subset of linear combinations of the n original variables which, according to various criteria, represents the best approximation of the original variables (see, for example, Jolliffe, 2002). The user therefore expects that the information in the data can be summarized into a few principal components. Once the principal components have been determined, all further analysis can be carried out on them instead of on the original data, as they summarize the relevant information. Thus, PCA is frequently Correspondence to: Fac C. Económicas y Empresariales, Plaza Infanta Elena s/n, BURGOS 09001, Spain. Tel.: +34 947 25 90 21; fax: +34 947 25 80 13. E-mail addresses: [email protected] (J. Pacheco), [email protected] (S. Casado), [email protected] (S. Porras). 0167-9473/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2012.06.014

Transcript of Ecuaciones Taylor

Computational Statistics and Data Analysis 57 (2013) 95–111

Contents lists available at SciVerse ScienceDirect

Computational Statistics and Data Analysis

journal homepage: www.elsevier.com/locate/csda

Exact methods for variable selection in principal component analysis:Guide functions and pre-selection

Joaquín Pacheco ∗, Silvia Casado, Santiago PorrasDepartamento de Economía Aplicada, Universidad de Burgos, Spain

a r t i c l e i n f o

Article history:Received 16 November 2011Received in revised form 5 June 2012Accepted 13 June 2012Available online 18 June 2012

Keywords:PCAVariable selectionBranch & Bound methodsGuide functionsFilters

a b s t r a c t

A variable selection problem is analysed for use in Principal Component Analysis (PCA).In this case, the set of original variables is divided into disjoint groups. The problemresides in the selection of variables, but with the restriction that the set of variables thatis selected should contain at least one variable from each group. The objective functionunder consideration is the sum of the first eigenvalues of the correlation matrix of thesubset of selected variables. This problem, with no known prior references, has two furtherdifficulties, in addition to that of the variable selection problem: the evaluation of theobjective function and the restriction that the subset of selected variables should alsocontain elements from all of the groups. Two Branch & Bound methods are proposed toobtain exact solutions that incorporate two strategies: the first one is the use of ‘‘fast’’ guidefunctions as alternatives to the objective function; the second one is the preselection ofvariables that help to comply with the latter restriction. From the computational tests, it isseen that both strategies are very efficient and achieve significant reductions in calculationtimes.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction

Data sets with large numbers of variables are processed in many disciplines such as economics, sociology, engineering,medicine and biology, among others. The researcher has to process a lot of data classified by a large number of variables,which are often difficult to summarize or interpret. One useful approach involves the reduction of data dimensionality,while trying to preserve as much of the original information as possible. A common way of doing this is through PrincipalComponent Analysis (PCA).

PCA is widely applied in datamining to investigate data structures. Its purpose is to construct components, each of whichcontains a maximal amount of variation with respect to the data that are unexplained by any other components. Standardresults guarantee that retaining the k Principal Components (PCs) with the largest associated variance produces the k-subsetof linear combinations of the n original variables which, according to various criteria, represents the best approximation ofthe original variables (see, for example, Jolliffe, 2002). The user therefore expects that the information in the data can besummarized into a few principal components. Once the principal components have been determined, all further analysis canbe carried out on them instead of on the original data, as they summarize the relevant information. Thus, PCA is frequently

∗ Correspondence to: Fac C. Económicas y Empresariales, Plaza Infanta Elena s/n, BURGOS 09001, Spain. Tel.: +34 947 25 90 21; fax: +34 947 25 80 13.E-mail addresses: [email protected] (J. Pacheco), [email protected] (S. Casado), [email protected] (S. Porras).

0167-9473/$ – see front matter© 2012 Elsevier B.V. All rights reserved.doi:10.1016/j.csda.2012.06.014

96 J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

considered the first step of a statistical data analysis that aims at data compression: decreasing the dimensionality of thedata, but losing as little information as possible.

While PCA is highly effective at reduced-dimensional representation, it does not provide a real reduction ofdimensionality in terms of the original variables, since all n original variables are required to define a single PrincipalComponent (PC). As McCabe (1984) has stated, ‘‘interpretation of the results and possible subsequent data collection andanalysis still involve all of the variables’’. Moreover, Cadima and Jolliffe (1995, 2001) have shown that a PC can provide amisleading measure of variable importance, in terms of preserving information, because it is based on the assumption thatthe resultant of a linear combination (PC) is dominated by the vectors (variables) with large magnitude coefficients in thatlinear combination (high PC loadings). This assumption ignores the influence of the magnitude of each vector (the standarddeviation of each variable) and the relative positions of the vectors (the pattern of correlations between the variables). Onewayof achieving a simple interpretation is to reduce the number of variables, that is, to look for a subset of then variables thatapproximate, as far as possible, the k retained PCs. We consider the combinatorial problem of identifying, for any arbitraryinteger p (k ≤ p ≤ n), a p-variable subset which is optimal with respect to a given criterion.

In most cases, the inclusion of all the variables in a statistical analysis is, at best, unnecessary and, at worst, aserious impediment to the correct interpretation of the data. From a computational point of view, variable selection is aNondeterministic Polynomial Time-Hard or NP-Hard problem (Kohavi, 1995); and (Cotta et al., 2004): there is thereforeno guarantee of finding the optimum solution. This means that when the size of the problem is large, finding an optimumsolution is, in practice, unfeasible. Two different methodological approaches have been developed for variable selectionproblems: optimal or exact techniques (enumerative techniques), which are able to guarantee an optimal solution, butwhich are only applicable to small-sized sets; and heuristic techniques, which are able to find good solutions (althoughunable to guarantee the optimum) within a reasonable amount of time.

The problem of selecting a subset of variables from a larger candidate pool abounds in areas such as multiple linearregression (Furnival and Wilson, 1974; Miller, 2002; Gatu and Kontoghiorghes, 2006; Hofmann et al., 2007; Gatu et al.,2007), logistic regression (Pacheco et al., 2009), polynomial regression (Peixoto, 1987; Brusco et al., 2009b; Brusco andSteinley, 2010), factor analysis (Kano and Harada, 2000; Hogarty et al., 2004), cluster analysis (Brusco and Cradit, 2001;Steinley and Brusco, 2008; Krzanowski and Hand, 2009), and discriminant analysis (McCabe, 1975; McKay and Campbell,1982a,b; Pacheco et al., 2006).

Specifically, the problem of variable selection in PCA has been investigated by Jolliffe (1972), Jolliffe (1973), Robert andEscoufier (1976), McCabe (1984), Bonifas et al. (1984), Krzanowski (1987a), Falguerolles and Jmel (1993), Mori et al. (1994),Jolliffe (2002), Duarte Silva (2002), Cadima et al. (2004), Mori et al. (2007) and Brusco et al. (2009a) among others. Thesestudies sought to obtain PCs based on a subset of variables, in such a way that they retain as much information as possiblein comparison to PCs that are based on all the variables. To address this problem, it is necessary to tackle two secondaryproblems: (1) the establishment of an objective criterion that canmeasure the quality or fitness of every subset of variables;and (2) the development of solution procedures for finding optimal, or at least near-optimal, subsets based on these criteria.The methods proposed by Jolliffe (1972, 1973) consider PC loadings, and those of McCabe (1984) and Falguerolles and Jmel(1993) use a partial covariance matrix to select a subset of variables which, as far as possible, maintains information onall variables. Robert and Escoufier (1976) and Bonifas et al. (1984) used the RV-coefficient, and Krzanowski (1987a,b) usedProcrustes analysis to evaluate closeness between the configuration of PC computations based on selected variables andthe configuration based on all of the variables. Tanaka and Mori (1997) discussed a method called ‘‘modified PCA’’ (MPCA)to derive PCs which were computed by using only a selected subset of variables but which represented all of the variables,including those thatwere not selected. SinceMPCAnaturally includes variable selection procedures in its analysis, its criteriacan be used directly to detect a reasonable subset of variables. Further criteria may be considered based, for example, on theinfluence analysis of variables and on predictive residuals, using the concepts reported in Tanaka and Mori (1997), and inKrzanowski (1987a,b), respectively; for more details see Iizuka et al. (2003).

Thus, the existence of several methods and criteria is one of the typical features of variable selection in multivariatemethods without external variables such as PCA (here the term ‘‘external variable’’ is used as a variable to be predictedor explained using the information derived from other variables). Moreover, the existing methods and criteria oftenprovide different results (selected subsets of variables), which is regarded as a further typical feature. This occurs becauseeach criterion has its own reasonable variable selection method, despite its purpose of selecting variables; we cannotsay, therefore, that one is better than any other. These features are not observed in multivariate methods with externalvariable(s), such as multiple regression analysis (Mori et al., 2007).

In practical applications of variable selection, it is desirable to have a computational environment where those whowantto select variables can apply a suitable method for their own selection purposes without difficulties and/or can try variousmethods and choose the best method by comparing results.

In many studies, the initial variables are divided into previously selected groups. In these cases it is required, or at leastrecommended to use variables from all groups considered. This happens, for example, in the construction of compositeindicators that are used in several areas (economy, society, quality of life, nature, technology, etc.). They are used asmeasureof the evolution of regions or countries in such areas. The composite indicators should try to cover all points of view of

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111 97

the analysed phenomenon (which may be identified with each of the different groups of variables) and should thereforecontain at least one variable from each group (or other types of similar conditions), so that they encompass all the pointsof view.

The importance of composite indicators is explained inNardo et al. (2005a,b) andBandura (2008) amongother references.The convenience of using variables from all groups considered is explicitly indicated at least in Nardo et al. (2005a),Ramajo-Hernández and Márquez-Paniagua (2001) and López-García and Castro-Núñez (2004). In several of the examplesmentioned in these previous references and links (see http://composite-indicators.jrc.ec.europa.eu/articles_books.htm)there are previous groups of variables and every group participates in the final Composite Indicator. For example, inTangian (2007) is given an example of building a composite indicator to measure the working conditions. In this work 10groups of variables are considered (Physical environment,Health, Time factors, etc.). In Chan et al. (2005) a composite indicatoris built to be used as an analytic tool to examine the quality of life in Hong Kong. The Index is released annually. It consistsof 20 variables that are grouped into three groups: Social (10 variables), economic (7) and environmental (3). In Nardoet al. (2005a,b) a Technology Achievement Index is proposed, and the following groups are considered: creation of technology(2 variables), diffusion of recent innovations (2), diffusion of old innovations (2) and human skills (2). In López-García and Castro-Núñez (2004) an indicator for regional economic activity in Spain is built. The following groups are considered: Agriculture(2 variables), Construction (1), Industry (4),Merchandise Services (9) and No-Merchandise Services (2). Several more examplesabout it can be found in the literature. Some of them can be taken from Bandura (2008), an annual survey with about 170international composite indicators.

Several restrictions can be considered in order to ensure the ‘‘participation’’ of every group in the final solution. In thiswork we consider the simplest one: ‘‘to take at least one variable from each group’’. This constraint has been consideredin Calzada et al. (2011). Nevertheless, other harder and/or more sophisticated restrictions can be considered depending ondifferent questions (initial size of every group, opinion of experts, etc.). In anyway this simplest constraint can illustrate thedifficulty of these kinds of problems.

A variable selection problem is studied in this work for subsequent use in PCA, but with the restriction that the set ofvariables that are selected contain at least 1 variable from each of the previously selected groups. The objective functionunder consideration is the percentage of the total variation explained by the first k Principal Components of the selectedvariables, which is to say, the sum of the first k eigenvalues divided by the sum of all eigenvalues. As is said in Cadima andJolliffe (2001), the first few principal components of a set of variables are derived variables with optimal properties in termsof approximating the original variables. Although other choices are possible (see Jolliffe, 2002, Section 10.1, Jolliffe, 1989)the principal components of interest are conventionally those with the largest variance. How many of these to keep can bedetermined in a variety of ways (Jolliffe, 2002, Section 6.1 and Richman, 1992 review some of many of the methods thathave been suggested). This paper deals with the k principal components simultaneously because if we retain k PCs, we areinterested in interpreting the space spanned by those PCs rather than being considered to individual PCs. The restrictionthat there should be at least one variable from each group in the selected set ensures that all viewpoints are considered. Inaddition, it helps to avoid the selection of variables that have high correlations between each other. In general the variablesof the same group are usually more highly correlated between each other than with the rest of the group. There are noreferences in the literature for this specific problem.

One difficulty in this problem is as follows: as the objective function under consideration is known (the percentage ofthe total variation explained by the first k Principal Components), it requires a high number of calculations every time it isevaluated (calculation and sum of eigenvalues). If an exploration is done in which many solutions are analysed, the totalcomputation time of the search might be excessive, due to the evaluation time of the objective function in each solution.This difficulty can be overcomewith the design ofmethodswhere the search is not directly guided by the objective function,but by other alternative functions that are related to it, which require fewer calculations. The other main difficulty comesfrom the restriction that selected subsets should have at least one element from each group. The process of constructingsolutions (in the context of both exact and heuristic methods) should consider this restriction, as otherwise the constructedsolutions might not comply with this construction criterion and might, therefore, be rejected. The inclusion of a strategywhichwill, as far as possible, prevent the generation of unfeasible solutions, and in consequence the completion achievementof unnecessary calculations appears necessary.

In this paper, two Branch and Bound-based methods are proposed. Two alternative guide functions that differ from theobjective function will be used in these exact methods. Specifically they will be used in the branch process. In addition,a pre-selection strategy will be used also in the branch process, which will favour the generation of feasible solutions. Aswill be confirmed later on, the alternative guide functions and the preselection manage to reduce total computing timeconsiderably, thereby achieving faster and more efficient explorations.

The remainder of this work is organized as follows: Section 2 is devoted to the problem definition and to show sometheoretical results that will be used later on. Section 3 describes the proposed guide functions. Section 4 describes thetwo Branch and Bound methods. The computational experiences are shown in Section 5. They analyse, above all, theeffect of the use of guide functions and the pre-selection strategy. Finally, Section 6 presents the final conclusions of thestudy.

98 J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

2. Definition of the problem

2.1. Prior definitions

Consider a data matrix, X, corresponding to m cases and characterized by n variables. We shall label the set of variablesV = {1, 2, . . . , n} (the variables are identified by their indices for the sake of simplicity).

Let xij be the value of variable j in the case i, i = 1, . . . ,m; j = 1, . . . , n;

Let xj be the column vector with the values of variable j, in other words xj =

x1jx2j·

xmj

j = 1, . . . , n;

It is known that X =xiji=1..m;j=1..n =

x1 x2 · · · xn

.

Consider the correlation matrix Σ =ρiji,j=1..n, in other words ρij is the correlation between variables i and j.

For any subset of variables S ⊂ V , the correlation submatrix of the variables of S is denoted by Σ (S).Given a small fixed positive integer k (for example k = 2, 3, or 4), then

f (S) = the sum of the largest k eigenvalues of the matrix Σ (S).

Logically, |S| ≥ k, if |S| = k, obviously f (S) = k.

2.2. Formulation of the problem

Consider a partition in q groups of the subset of variables V , that is V =q

r=1 Gr ; where Gr represents each one of thegroups of variables into which V is divided. Let p ∈ N , checking that p ≥ k, p ≥ q, p ≤ n, so that the problem may bedefined as

Maximize f (S). (1)

subject to

|S| = p (2)S ∩ Gr = ∅, r = 1..q (3)S ⊂ V . (4)

The optimum solution and the value corresponding to the problem defined by (1)–(4) are respectively denoted by S∗p and

g(p), that is

g(p) = f (S∗

p ) = max{f (S) : |S| = p, S ⊂ V , S ∩ Gr = ∅∀r = 1..q}.

2.3. Observations and theoretical results

Let Σ be the correlation matrix defined in Section 2.1 (and in general any correlation matrix) and let λ1, λ2, . . . , λk betheir largest k eigenvalues in order (λ1 > λ2 > · · · > λk). It is well known that

λr = u′

r · Σ · ur ∀r = 1..k

where, {u1, u2 . . . uk} is the subset of orthonormal eigenvectors respectively associated with λ1, λ2, . . . , λk.

Theorem 1. Let Σ be the correlation matrix defined in Section 2.1 and λ1, λ2, . . . , λk its largest k eigenvalues in decreasingorder (λ1 > λ2 > · · · > λk), then

kr=1

λr = max

k

r=1

v′

r ·

·vr/ ∥vr∥ = 1, v′

s · vr = 0, r, s = 1..k; r = s

.

In order to demonstrate this result, it may be confirmed that the earlier set of orthonormal eigenvectors {u1, u2 . . . uk}

verifies the conditions of Lagrange optimality. Two corollaries arise as a consequence of this theorem. The mathematicalproofs of these corollaries are explained in the Appendix and their results will be used in the design of the Branch andBound algorithms described in Section 4.

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111 99

3. Search for values and guide functions

The problem proposed by (1)–(4) involves an important difficulty, due to the computation time needed to calculate thefunction f (S) for any S ⊂ V . The methods to determine the largest k eigenvalues of a positive-defined symmetric matrix oforder p and, as a result, its sum, are iterative methods based on QR decomposition (Francis, 1961–1962); (Kublanovskaya,1961). These methods usually use various iterations to converge, and θ(p3) operations are required for each iteration. If thesize of the problem (value of n) is not small, the number of solutions to evaluate will be large, and the required computationtime could therefore be high.

Moreover, when a set of variables S changes, even though only slightly (when a new variable is incorporated, an existingone eliminated or both simultaneously), the eigenvalues of the corresponding correlationmatrix change and there is noway(to the best of the authors’ knowledge) of calculating the eigenvalues of the newmatrix from the eigenvalues of the formermatrix that might involve a reduction in the calculation time. In other words, although the set of variables S ′ may have beenobtainedwith a smallmodificationmade to S, calculation of f (S ′) has to be done by applying themethodsmentioned earlier,and the information relating to S and f (S) cannot be used for a faster calculation of f (S ′). However, such information can beused as a guide or approximation, in some cases.

More specifically, for the sake of simplicity, let us assume that S = {1, . . . , p} and let Σ (S) be its correlation matrix,λ1, λ2, . . . , λp its eigenvalues ordered from large to small, and u1, u2, . . . and up the corresponding eigenvectors, verifying∥ui∥ = 1, and u′

i · uj = 0, ∀i, j ∈ {1, . . . , p}, i = j.Also:

– U = (u1 u2 · · · up) ∈ Rp·p

– x∗

i = the column vector of standardized variable i, i = 1..p– X∗(p)

= the data matrix corresponding to the standardized variables of S, i.e.

X∗(p)= (x∗

1x∗

2 · · · x∗

p)

– zi = X∗(p)· ui i.e. the column vector of the values of the inth component.

Consider now the matrix U∗=

U 00′ 1

∈ Rp+1·p+1, where 0 = (0, 0, . . . , 0)′ ∈ Rp, and consider any t ∈ V − S (in this

case any t ∈ {p + 1, . . . , n}). It may easily be verified that (ziz2 · · · zpx∗t ) = U∗

· (x∗

i x∗

2 · · · x∗px

∗t ). In other words, the set of

columns (ziz2 · · · zpx∗t ) may be obtained by an orthonormal transformation from the set of columns (x∗

i x∗

2 · · · x∗px

∗t ), and the

variance–covariance matrices of both sets of columns will therefore have the same eigenvalues. We refer to both matricesas C and D, respectively. Thus:

D =

λ1 0 · · · 0 σ10 λ2 · · · 0 σ2· · · · · · ·

0 0 · · · λp σpσ1 σ2 · · · σp 1

in which, σi is the covariance between the columns zi and x∗

t , i = 1, . . . , p.Both matrices have the same eigenvalues, and although the second has a simpler appearance, the calculation of its

eigenvalues continues to require (according to our knowledge) the use of the aforementionedmethods. Now, if we considercolumns z∗

i =zi√λi

, i = 1, . . . , p, then the variance–covariance matrix E of columns (z∗

1 z∗

2 · · · z∗p x

∗t ) is as follows:

E =

1 0 · · · 0 σ1∗

0 1 · · · 0 σ2∗

· · · · · · ·

0 0 · · · 1 σp∗

σ1∗ σ2∗ · · · σp∗ 1

in which, σ ∗

i is the covariance between columns z∗

i yx∗t , i = 1, . . . , p (and also its correlation as they are standardized

variables).The eigenvalues λ∗

i of matrix E may be easily calculated by

λ∗

1 = 1 +

σ1 ∗2 +σ2 ∗2 + · · · + σp ∗2,

λ∗

i = 1, i = 2, . . . , p;

λ∗

p+1 = 1 −

σ1 ∗2 +σ2 ∗2 + · · · + σp ∗2.

These eigenvalues of matrix λ∗

i are obviously not the same as those of D (and therefore of C), as columns zi have beenreplaced by z∗

i , (that is, they have been standardized); however, they can given us an idea of the contribution of each columnx∗t to the sum of eigenvalues of C . As it is a question of maximizing the sum of higher eigenvalues of C , it could be thought

100 J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

that this could be achieved by maximizing the values of the largest λ∗

i and, more specifically, the expression of the radicandthat appears in λ∗

1 .We first calculate σ ∗

i , i = i, . . . , p, in order to evaluate the following expression:

σ ∗

i = cov(x∗

t , z∗

i ) =1

√λi

pj=1

uij · ρjt ,

in which, uij represents the components of vector ui. Thus

pi=1

σi ∗2

=

pi=1

1λi

p

j=1

uij · ρjt

2

(5)

the value of (5), the radicand that appears in λ1, requires a calculation time of θ(p2) once the eigenvectors ui are obtained.The sum of the square of the co-variances of column x∗

t with columns z∗

i can give us a relativemeasure of the contributionto the objective function of variable t when it is added to S (in comparison with the other variables that may be added). Inthe same way a further measure relating to that contribution could be given by the sum of the co-variances to the square ofcolumn x∗

t with the original columns zi (that is, not standardized).If σi is the co-variance between x∗

t and zi, then the sum will take the following form:

pi=1

σ 2i =

pi=1

p

j=1

uij · ρjt

2

=

pi=1

ρ1t · · · ρpt

· ui · u′

i ·ρ1t · · · ρpt

′=ρ1t · · · ρpt

·

p

i=1

ui · u′

i

·ρ1t · · · ρpt

′=ρ1t · · · ρpt

· Ipp ·

ρ1t · · · ρpt

′;

where Ipp is the identity matrix of order p. The last equality is consequence of the orthonormality of columns ui.Thus:

pi=1

σ 2i =

pi=1

ρ2it. (6)

This expression requires a calculation time of θ(p) and, in addition, requires no prior calculation of eigenvectors ui.Let S be any set of variables S ⊂ V , then for each t ∈ V − S let us define h0(S, t) = f (S ∪ {t}), that is the value

of the objective function when t is added to S. Similarly, we shall denote expressions (5) and (6) by h1(S, t) and h2(S, t),respectively. In Section 5, we shall examine the efficacy of both functions as ‘‘guides’’ to determine the variables t ∈ V − Swith larger values of h0(S, t).

4. Description of the methodology

4.1. First Branch and Bound method

Corollaries 1 and 2 (in the Appendix) allow the design of an exact method Branch and Bound (BnB) basedmethod, similarto some that we can find in the literature for several variable selection problems (Brusco et al., 2009a,b; Brusco and Steinley,2011). This method allows to find the optimal solution, S∗

p and g(p), for all values of p (verifying p ≥ k, p ≥ q, p ≤ n), whenthe value of n is moderate.

The BnB algorithm performs a recursive analysis of the set of solutions. This analysis is performed by means of a searchtree. Each node of the tree corresponds to a set of solutions. When a node J is explored, the question of whether the set ofassociated solutions can improve some of the values g(p) found up until that moment is determined. If it is determined thatno associated solution to node J can improve any value of g , the exploration of that node is ended. If otherwise, the associatedset is divided into two subsets that are associated with both nodes K and L that emerge from node J . Subsequently, nodes Kand L are explored. Initially the origin node, the one that corresponds to all the solutions, is explored.

More specifically, the solutions associated with each node J are determined by two subsets A and B ⊂ V , such thatA∩B = ∅. In this way, the set of solutions for J are all subsets S ⊂ V that contain A (A ⊂ S) and that do not contain elementsof B (B∩S = ∅). (Put another way, the elements of Awould be ‘‘fixed, and those of B ‘‘forbidden’’). Division of node J in nodesK and L entails the determination of an element a ∈ V−A−B. Subsequently, are defined the sets A′

= A∪{a}, B′= B, A′′

= Aand B′′

= B ∪ {a} and then nodes L and K are respectively associated with the solutions determined by A′′ and B′′ (L) and A′

and B′ (K ). Fig. 1 illustrates the way in which the recursive division functions.

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111 101

Fig. 1. Branch process in the first Branch and Bound method.

Let A and B be two subsets A, B ⊂ V , such that A ∩ B = ∅; a description in pseudocode of the exploration of each nodeassociated with them is as follows:

Thus, the BnB method functions in the following way:

Finally, the following points should be made:

– The restriction that is asked for in line (7) of the Exploration_node procedure ensures that there are no solutions in thatnode that will improve the values of g and the exploration should therefore be ended. This is based on Corollaries 1 and 2from the Appendix. In fact, it follows that any solution S in that node verifies A ⊂ S ⊂ V − B. Therefore, if the restrictionin line (7) is fulfilled, then f (S) ≤ f (V − B) ≤ g(|A|) ≤ g(|S|); so, S will not improve g(|S|).

– The way variable a is chosen in line (8) will obviously not change the final result, but it can be important for totalcalculation time. In our case, a ∈ V − A − B is chosen which increases more the value of the objective function byadding it to A. In other words, a is chosen such that f (A ∪ {a}) = max{f (A ∪ {v})/v ∈ V − A − B}. As commented onearlier, calculation of f can spend a high calculation time; thus is proposed as an alternative to choose a ∈ V −A−B thatmaximizes either h1(A, a) or h2(A, a).

102 J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

– Furthermore, the restriction of having one element of each group in the solutions that are obtained, can be taken intoaccount when considering the element a to enter. Regardless of the guide function in use (f , h1 or h2), unnecessaryexplorations may be avoided if preference is given to the variables of those groups that have no element in A. Moreformally, a subset of variables Pre_Sel ⊂ V − A − Bmay be formed and a can be chosen in this subset. The subset Pre_Sel(pre-selected variables) is defined as follows:

If A ∩ Gr = ∅, ∀r ∈ {1..q} then makePre_Sel = V − A − B;

otherwise, make

Pre_Sel = {v ∈ (V − A − B) ∩ Gr : r ∈ {1..q}, A ∩ Gr = ∅}.

The effect of using this alternative form of selecting awill be analysed in the computational tests.

4.2. Second Branch and Bound method

A second Branch and Boundmethod (BnB2) will nowbe explained. The differencewith the first one is that from each nodeof the search tree can emerge two or more branches. The idea of this new method is try to obtain the greatest advantagefrom the earlier information. In consequence, unnecessary divisions are identified in the exploration of each node (withsolutions that do not improve the current ones) as soon as possible.

As before, each node has two subsets A and B ⊂ V , such that A∩B = ∅. In that way, as in themethod BnB, the S solutionsassociated with the aforementioned node are all the subsets S ⊂ V that contain A (A ⊂ S) and that do not contain elementsof B (B ∩ S = ∅). Exploration of each node in this case may be described as follows in pseudocode.

The size t of subset {a1, a2, . . . , at} is not fixed. For each node defined by A and B, the way in which this subset isdetermined in line (10) is described as follows:

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111 103

Fig. 2. Branch process in the second Branch and Bound method.

In this procedure, ar values are selected that cause the largest decrease in the value of f (Lt), when that element iseliminated from Lt−1 (in other words V −B−{a1, a2, . . . , at−1}). In order to determine those at elements, as the calculationtime of the corresponding f values can be high, the sum of their correlations to the square with the other elements of Lt−1 isused as the guide function. Initially, a1 ∈ V − B − Awith the largest value in the sum of the square correlations is selected,subsequently a2, a3 and so on, are successively determined by recursive methods. The process continues until a set Lt isformed, such that either g(|A|) ≥ f (Lt) or Lt contain no elements from all groups.

Note that, in the first iteration, the value of the sum of the correlation squares of each element v ∈ L0 − Awith the otherelements of L0, that appear in (11) requires θ(|L0 −A|− 1) operations. However, this summay be very easily updated in theother iterations, in the following way:

b∈Lr−1b=v

corr2 (v, b) =

b∈Lr−2b=v

corr2 (v, b) − corr2 (v, ar−1) .

As in BnBmethod, in order to avoid unnecessary explorations preferencemay be given to the choice of groups of variablesthat have no element in A. Thus, instead of considering Lr−1 − A as an alternative, line (11) proposes that the elementsv ∈ Pre_Sel ⊂ Lr−1 − A should be considered, where the subset Pre-Sel is defined in the following way:

If A ∩ Gr = ∅, ∀r ∈ {1..q}, then makePre_Sel = Lt−1 − A;

otherwise, make

Pre_Sel = {v ∈ (Lt−1 − A) ∩ Gr : r ∈ {1..q}, (A ∪ {a1, a2, . . . , at−1}) ∩ Gr = ∅}.

Fig. 2 illustrates the way in which the recursive division functions.Finally, as in the earlier case, the BnB2method is very simple.

5. Computational experiences

5.1. Design of correlation matrix samples

A series of correlation matrix samples will be generated for the different computational tests. The process of generatingthese matrices, similar to that used in Brusco et al. (2009a,b), consists of designing population correlation matrices L; a setof vectors following the normal distribution with the L correlation matrix is generated from each population correlationmatrix L, and finally the sample correlation matrix corresponding to Σ is obtained.

104 J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

Themethod of generating vectors of a certainmultivariate normal distribution of order n, 0means, and correlationmatrixL, that is N(0, L), is as follows. A lower triangular matrix T , of order n, is determined such that T · T ′

= L, subsequently thevectors z with distribution N(0, In) are determined, and y = T · z is calculated; the y vectors calculated in this way followthe distribution N(0, L).

There are several ways of obtaining the lower triangular matrix T such that T ·T ′= L. In our case, the square rootmethod

was used, which we find in works by Naylor (1977) and Rubinstein (1981). Also, different methods may be found in thesetexts to generate the values of a normal distribution N(0, 1).

The population correlation matrices L are designed according to a simple pattern: the correlations between the differentvariables can have two values: a high value if they belong to the same group, or otherwise a low one. The populationmatricestherefore depend on the following parameters:

– n: the number of variables and therefore the sample size– nblocks: number of groups– sizeblock: size of each group (to simplify, let us suppose groups of the same size, and then sizeblock = n/nblocks)– wcor: correlation between variables of the same group (let us suppose that this correlation is higher than between the

variables of different groups)– lcor: correlation between variables of different groups.

A population matrix L is shown below with the following values: n = 12, nblocks = 4, sizeblock = 3,wcor = 0.7 andlcor = 0.2.

1 0.7 07 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.20.7 1 0.7 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.20.7 0.7 1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.20.2 0.2 0.2 1 0.7 0.7 0.2 0.2 0.2 0.2 0.2 0.20.2 0.2 0.2 0.7 1 0.7 0.2 0.2 0.2 0.2 0.2 0.20.2 0.2 0.2 0.7 0.7 1 0.2 0.2 0.2 0.2 0.2 0.20.2 0.2 0.2 0.2 0.2 0.2 1 0.7 0.7 0.2 0.2 0.20.2 0.2 0.2 0.2 0.2 0.2 0.7 1 0.7 0.2 0.2 0.20.2 0.2 0.2 0.2 0.2 0.2 0.7 0.7 1 0.2 0.2 0.20.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 1 0.7 0.70.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.7 1 0.70.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.7 0.7 1

.

As may be seen, in order to simplify the design, the variables of the same groups follow each other, starting with thevariable 1. In this case, the first block is formed by variables 1, 2, and 3, the second by the variables 4, 5 and 6, and so on.

Finally, to remain consistent with previous definitions, m is the number of vectors or cases that are generated followingthe distribution N(0, L), from which Σ is subsequently obtained. A value ofm = 1000 was used.

5.2. Use of the guide functions h1 and h2 to add elements to solutions

In this section, a set of computational tests are described to check the efficiency and efficacy of the two functions h1 andh2 when they are used, instead of the objective function f , as a guide to determine the best elements to add to the solutionS. More specifically, suppose that we have a solution S, and we wish to add an element v ∈ V − S that increases the value off (S ∪ {v}). The value of f (S ∪ {v}) may be calculated for each v ∈ V − S, although as commented on earlier, it might requirea lengthy calculation time. An alternative, mentioned in Section 3, applies functions h1 and h2 that use less calculation time.In this sub-section the quality of these approximations is tested.

A set of tests were therefore conducted in which a set V with its corresponding correlation matrix was considered ineach one, as well as a subset S ⊂ V obtained at random, and a preset value for k. The following were calculated in every oneof these tests

– fmax = max{f (S ∪ {v})/v ∈ V − S}– fmin = min{f (S ∪ {v})/v ∈ V − S}– v1 = argmax{h1(S, v)/v ∈ V − S}– v2 = argmax{h2(S, v)/v ∈ V − S}– r1 = (f (S ∪ {v1}) − fmin)/(fmax − fmin)– r2 = (f (S ∪ {v2}) − fmin)/(fmax − fmin).

The values r1 and r2 could be quality indicators of the candidate variables to use in S that were suggested by h1 and h2.Moreover, v3 ∈ V − S was chosen at random and r3 = (f (S ∪ {v3}) − fmin)/(fmax − fmin) was calculated.

These tests have been divided in sets according to the values of n(=|V |), p(=|S|) and k (∈ {2, 3}). In all, 20 combinationsof these values were considered, in other words 20 sets, and 100 tests were conducted for each one. The groups were namedby the values of the parameters. So, 20-6-2 indicates the group, in which n = 20, p = 6, and k = 2. The 20 groups underconsideration are shown in Table 1.

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111 105

Table 1Division into sets.

# Set

1 20-6-22 20-6-33 20-9-24 20-9-35 25-8-26 25-8-37 25-12-28 25-12-39 50-12-2

10 50-12-311 50-18-212 50-18-313 100-12-214 100-12-315 100-18-216 100-18-317 100-24-218 100-24-319 100-30-220 100-30-3

Table 2Summary of results.1

Set m1 m2 m3 Y1 Y2 Y3 T1 T2 T0

20-6-2 0.99101 0.98185 0.39025 999 998 203 0.000061 0.000000 0.00050020-6-3 0.98737 0.97902 0.48697 1000 1000 487 0.000047 0.000000 0.00050120-9-2 0.95115 0.93721 0.35196 969 956 204 0.000140 0.000000 0.00074720-9-3 0.98828 0.97478 0.45421 997 997 284 0.000094 0.000000 0.00074825-8-2 0.97395 0.96831 0.32228 989 995 142 0.000047 0.000015 0.00093525-8-3 0.99472 0.98838 0.46189 1000 1000 382 0.000096 0.000000 0.00090725-12-2 0.93317 0.90631 0.36185 948 915 220 0.000157 0.000000 0.00158025-12-3 0.95742 0.93010 0.36159 993 982 243 0.000247 0.000000 0.00169950-12-2 0.95063 0.95421 0.22190 963 982 72 0.000300 0.000000 0.00416050-12-3 0.98021 0.98152 0.28572 988 999 120 0.000219 0.000000 0.00407050-18-2 0.91520 0.89546 0.27501 910 881 115 0.000530 0.000000 0.00742750-18-3 0.93772 0.92604 0.25350 981 978 125 0.000486 0.000015 0.006760100-12-2 0.96283 0.94783 0.19646 965 982 77 0.000345 0.000000 0.009166100-12-3 0.98690 0.96361 0.23799 998 995 143 0.000267 0.000016 0.009298100-18-2 0.94968 0.92309 0.18365 958 941 42 0.000688 0.000015 0.017865100-18-3 0.97772 0.95779 0.21339 986 988 68 0.000537 0.000000 0.018336100-24-2 0.92245 0.85352 0.21521 916 790 62 0.001341 0.000000 0.030685100-24-3 0.95992 0.89764 0.20109 991 925 66 0.001079 0.000016 0.029221100-30-2 0.87862 0.80894 0.25126 819 667 61 0.001872 0.000015 0.042016100-30-3 0.93784 0.82939 0.20992 968 779 72 0.002096 0.000015 0.044597

Note that 4 values for nwere considered: 20, 25, 50 and 100. A single correlationmatrix was considered for each value ofn. Thus, 4 sample correlationmatriceswere generated from 4 populationmatrices, as it was explained in Section 5.1, the sizewere n = 20, 25, 50 and 100. The values of the other parameters of the population matrices are: nblocks = n/5, sizeblock =

5,wcor = 0.7 and lcor = 0.2.A summary of the test results by sets are shown in Table 2. Specifically:

– mi columns indicate the average of the values ri, i = 1..3.– Yi columns indicate the number of times in which ri has obtained a value higher than 0.75, i = 1..3.– Ti columns indicate the time measured in seconds used to obtain vi in each test, i = 1..3.– T0 column indicates the average time used to obtain fmax.

The following conclusions may be drawn from the results shown in Table 2:

– Guide functions h1 and h2 yield results that are very close to those that would be obtained using the objective function,but with a much lower computation time.

– As it can be observed, in the case of h1 the corresponding m1 values are, in all cases, over 0.90, except for the group100-30-2 that dropped to 0.878. In 14 groups, 0.95 is exceeded and there are even two sets in which 0.99 is exceeded.

1 Values in Table 2 that are 0.000000 in T columns indicates values lower than 10−6 .

106 J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

Table 3Average calculation times and explorations of each BnB variant.

n Computation time Number of explored nodesVariant 0 Variant 1 Variant 2 Variant 0 Variant 1 Variant 2

12 0.2495 0.291 0.183 3662.00 3748.75 3744.5015 2.636 2.936 1.840 21491.50 20922.50 20989.7518 23.687 25.707 15.510 126838.25 127766.25 127576.0021 206.474 207.667 121.340 938034.75 947198.25 952928.2524 1682.827 1644.170 936.152 5892405.50 5987401.75 5985164.50

If we observed column Y1, it is clear that in all the sets, except for set 100-30-2, the number of tests where r1 is greaterthan 0.75 is largely in excess of 900. Finally, the computation time used in the calculation of v1 (T1) is very much lowerthan that used to calculate fmax (T0).

– With regard to the analysis of the use of h2, the values form2 are also over 0.90 in 15 of the 20 sets. In the other 5 sets, itis over 0.80, and 4 of these correspond to larger p values. Similarly, the values of column Y2 are over 900 in 16 of the 20cases. Finally, the calculation time used by h2 to calculate v2 is very much lower than that used to calculate fmax. In 13of the 20 sets, the average time T2 is less than 10−6 s.

– In general, the use of h1 allows obtaining better results than h2. Nevertheless, the calculation time in the second caseis clearly less than in the first one. Moreover, in both cases a certain deterioration of the results for high values of p(p = 24, 30) is evident, although they are still good.

– From observation of the values of m3 and Y3, it may be assumed that the random selection of the elements of V − Sobtains results that are much worse than those obtained with h1 or h2. In fact, the values obtained form3 are lower than0.5. There is therefore no distribution of the values of f (S ∪ {v}) that favours the production of high values; instead, thecontrary appears to be true. In consequence, the high values obtained when h1 or h2 are used must be attributed to thequality of these functions as guides.

In Section 5.3, the effect of the use of these guide functions in the different Branch and Bound algorithmswill be analysed.

5.3. Analysis of the first Branch and Bound method with different guide functions

In this sub-section and in Sections 5.4 and 5.5 different variants of both Branch and Bound methods are described andanalysed. A table summarizing the features of every variant is shown at the end of Section 5.5.

As commented on in Section 4.1, theway of selecting element a ∈ V −A−B in line (8) of the Exploration_node procedurewill obviously not change the final result of the BnBmethod, but it may affect its computation time. In this sub-section, threevariant methods of selecting this element awill be compared:

– a = argmax{f (A ∪ {v})/v ∈ V − A − B} (Variant 0)– a = argmax{h1(A, v)/v ∈ V − A − B} (Variant 1)– a = argmax{h2(A, v)/v ∈ V − A − B} (Variant 2).

In all, 20 correlation matrix samples were generated from 20 population matrices that were obtained by combining thefollowing parameter values: n = 12, 15, 18, 21, 24; nblocks = n/3; sizeblock = 3;wcor = 0.5, 0.7 y lcor = 0.1, 0.2. Thus fivedifferent sizeswere considered of n and fourmatrices by size. The three variants of the BnB algorithmwere executed, for eachmatrix and each value of k (k = 2, 3). Obviously each variant obtains the optimum S∗

p and its values g(p), ∀p ∈ {k+1, . . . , n}.We are now interested in analysing the difference in calculation times used by each variant. Table 3 is shown below withthe average times of each variant for each value of n. The average number of explored nodes is also shown (in other words,the number of times the procedure Exploration_node is executed).

The following may be observed in Table 3:

– Variant 2 uses less computation time than variants 0 and 1. This reduction is more easily appreciated as the size of nincreases and is close on 45% for n = 24. Likewise, variants 0 and 1 use similar computation times. Perhaps variant 1employs something less for matrices of greater size (n = 24). The significance of these differences will be analysed later.

– The three variants present a very similar evolution of the number of explorations. If we compare the values of variants0 and 2, we observe that the latter needs a slightly higher number of explorations (except for n = 15). Thus, theexplorations of the variant 0 appear to be more effective but, taking the computation time into account, the explorationsof variant 0 are more efficient.

Table 4 shows the results of several t-tests of differences of means (Means) in calculation time for each pair of variantsunder consideration and for each value of n.

It can be seen from Table 4 that variant 1 only uses less calculation time than variant 0 in the larger-sized problems, butthis difference is not significant (tail probability of 0.475). Variant 2 has significantly shorter calculation times than the othertwo variants for all values of n. The tails of probabilities are all equal or less than 0.05, except two cases (var.0–var.1 withn = 21 and n = 24).

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111 107

Table 4Differences in average tests of calculation times shown in Table 3.

n Pairs Means

12Var. 0–Var. 1 −0.041125Var. 0–Var. 2 0.066125Var. 1–Var. 2 0.107250

15Var. 0–Var. 1 −0.300000Var. 0–Var. 2 0.795750Var. 1–Var. 2 1.095750

18Var. 0–Var. 1 −2.020125Var. 0–Var. 2 8.176375Var. 1–Var. 2 10.196500

21Var. 0–Var. 1 −1.193125Var. 0–Var. 2 85.133375Var. 1–Var. 2 86.326500

24Var. 0–Var. 1 38.656500Var. 0–Var. 2 746.674625Var. 1–Var. 2 708.018125

Table 5Average calculation times for each variant of BnB.

n Variant 0 Variant 1 Variant 2 Variant 3 Variant 4 Variant 5

12 0.2495 0.291 0.183 0.213 0.251 0.15215 2.636 2.936 1.840 2.065 2.280 1.43318 23.687 25.707 15.510 17.710 18.451 11.05921 206.474 207.667 121.340 145.495 140.413 85.93824 1682.827 1644.170 936.152 1068.740 1009.731 589.427

5.4. Analysis of the preference by groups without elements in A

As commented in Section 4.1. (in addition to the choice of suitable guide functions), the efficiency of the Branch andBound procedure may be improved, if preference is given, in each exploration of nodes, to the choice of elements in thosegroups which have no elements in A in line (7). So that, the intention is to avoid unnecessary explorations. The followingthree new variants are proposed to analyse the efficiency of this strategy, i.e. ways of selecting a in line (7):

– a = argmax{f (A ∪ {v})/v ∈ Pre_Sel} (Variant 3)– a = argmax{h1(A, v)/v ∈ Pre_Sel} (Variant 4)– a = argmax{h2(A, v)/v ∈ Pre_Sel} (Variant 5)

where Pre_Sel was defined in Section 4.1. Note that these variants correspond to the 3 variants 0, 1 and 2, respectively, butchanging the set V − B − A for its corresponding subset Pre_Sel. Tests were conducted to determine the computation timeused for these 3 new variants with the same former values of k. Table 5 shows the average results for each value of n. Theaverage results of variants 0, 1 and 2 are added to the table for a better comparison.

As can be seen in Table 5, the variants used by the subset Pre_Sel reduce computation time with regard to thecorresponding variants that do not use this pre-selection. In other words variant 3 uses less computation time than variant0, variant 4 less than variant 1, and variant 5 less than variant 2 (more than 30% of the calculation time for large-sizedmatrices). Besides, a comparison of variants 3, 4 and 5 between each other and with variants 0, 1 and 2 both lead to similarconclusions: the computation times of variants 3 and 4 are similar, although for larger-sized problems, variant 4 is slightlyquicker. Variant 5 uses less computation time than both of them.

Tables 6 and 7 show the results of several t-tests of means applied to calculation time in different pairs of variantsunder consideration. The objective of these tests is to check whether these differences in calculation time are significant.Specifically, Table 6 contrasts the difference in calculation times between variants 0 and 3, variants 1 and 4 and variants 2and 5. In other words, the intention here is to contrast the pre-selection effect.

As can be seen from Table 6, regardless of the guide function in use, the use of preselection significantly improves thecalculation times. All tails of probabilities are all equal or less than 0.001.

Finally, Table 7 contrasts the differences in calculation times of variants 3, 4 and 5 taken in pairs. In other words, the ideais to compare the effect of the different guide functions when the preselection of elements is applied.

As can be seen from Table 7, the results are similar to those shown in Table 4. Variant 4 uses less computation time thanvariant 3 only for larger-sized problems (n = 21 and 24). However, this difference is significant for n = 24. Moreover,variant 5 uses a significantly shorter computation time than the other 2 for all values of n. The tails of probabilities are allequal or less than 0.05, except two cases (var.3–var.4 with n = 18 and n = 21).

In short, the use of suitable guide functions (especially h2) in the explorations of each node combined with a simplevariable preselection strategy (that avoids unnecessary explorations) considerably reduces the computation time. In the

108 J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

Table 6Mean t-tests in calculation time highlighting the pre-selection effect.

n Pairs Means

12Var. 0–Var. 3 0.036625Var. 1–Var. 4 0.039250Var. 2–Var. 5 0.031500

15Var. 0–Var. 3 0.571625Var. 1–Var. 4 0.656750Var. 2–Var. 5 0.407500

18Var. 0–Var. 3 5.976750Var. 1–Var. 4 7.255875Var. 2–Var. 5 4.451625

21Var. 0–Var. 3 60.978625Var. 1–Var. 4 67.253500Var. 2–Var. 5 35.402000

24Var. 0–Var. 3 614.086250Var. 1–Var. 4 634.438625Var. 2–Var. 5 346.725250

Table 7Means t-tests in calculation time between variants 3, 4 and 5.

n Pairs Means

12Var. 3–Var. 4 −0.038500Var. 3–Var. 5 0.061000Var. 4–Var. 5 0.099500

15Var. 3–Var. 4 −0.214875Var. 3–Var. 5 0.631625Var. 4–Var. 5 0.846500

18Var. 3–Var. 4 −0.741000Var. 3–Var. 5 6.651250Var. 4–Var. 5 7.392250

21Var. 3–Var. 4 5.081750Var. 3–Var. 5 59.556750Var. 4–Var. 5 54.475000

24Var. 3–Var. 4 59.008875Var. 3–Var. 5 479.313625Var. 4–Var. 5 420.304750

case of larger-sized problems, if variant 0 (average time 1682.827 s) is compared with variant 5 (589.427), the reduction isaround 66% of the calculation time.

5.5. Comparison of the 2 Branch and Bound methods (BnB and BnB2)

In this sub-section, the functioning of the second Branch and Bound (BnB2) method, described in Section 4.2. is analysed.A first set of tests has the aim of examining the effect of using variable pre-selection during the selection of the set ofvariables {a1, a2, . . . , at} in the exploration of each node. So, two variants of the method are considered, according to theset of variables from which element at in line (11) is selected:

– Variant 0: v ∈ Lr−1 − A– Variant 1: v ∈ Pre_Sel ⊂ Lr−1 − A

where Pre_Sel is defined in Section 4.2.The results obtained for both variants are shownbelow,with the samematrices used in Section 5.4. Specifically in Table 8,

the average results are shown (time and number of explorations). The results obtained by variant 5 of BnBmethod are added,in order to have a better comparison of both Branch and Bound methods.

It may be noted from Table 8 that fewer explorations are conducted with variant 1of BnB2 and that it uses lesscomputation time than Variant 0. Therefore, the use of the preselection of elements in BnB2 is also advantageous. However,Variant 1 of the BnB2 method does not outperform the computation time of the best variant (variant 5) of the BnB method,although it does involve a smaller number of explorations. Table 9 shows the results of several means t-tests for each valueof n of the earlier calculation times. Specifically, the calculation time of variants 0 and 1 of BnB2 (BnB2. v0 and BnB2. v1)are contrasted, and the calculation time of variant 1 of BnB2 and variant 5 of BnB (BnB. v5). Table 9 shows the difference ofmeans (Means).

The results of Table 9 show how the differences between the calculation times are clearly significant; moreover, thelarger the size of n, the greater it is significant. All tails of probabilities are all equal or less than 0.02.

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111 109

Table 8Comparison of the 2 Branch and Bound methods.

n Computation time Number of explored nodesBnB2 v.0 BnB2 v.1 BnB v.5 BnB2 v.0 BnB2 v.1 BnB v.5

12 0.199 0.16575 0.152 3757.38 2482.00 4715.7515 2.131 1.59525 1.433 28139.50 16387.13 29181.5018 20.409 13.90925 11.059 205088.38 104086.13 167619.2521 188.393 115.76575 85.938 1480388.38 651873.75 988215.2524 1673.336 894.73213 589.427 10458303.00 3862981.88 5344688.75

Table 9Differences in average tests of calculation times shown in Table 8.

n Pairs Means

12 BnB2.v0–BnB2.v1 0.033125BnB2.v1–BnB.v5 0.013875

15 BnB2.v0–BnB2.v1 0.536125BnB2.v1–BnB.v5 0.162000

18 BnB2.v0–BnB2.v1 6.499625BnB2.v1–BnB.v5 2.850625

21 BnB2.v0–BnB2.v1 72.627625BnB2.v1–BnB.v5 29.827250

24 BnB2.v0–BnB2.v1 778.604125BnB2.v1–BnB.v5 305.305375

Table 10Summary of the features of all variants.

Method Variants Guidefunction

Filters (Preselection)

First Branch and Bound (BnB)

BnB.v0 f NoBnB.v1 h1 NoBnB.v2 h2 NoBnB.v3 f YesBnB.v4 h1 YesBnB.v5 h2 Yes

Second Branch and Bound (BnB2) BnB2.v0 h2 NoBnB2.v1 h2 Yes

Table 10 summarizes the features of all variants analysed in this section.Finallywemust point-out that similar computational experiences performedwith different correlationmatrix structures,

and similar results have been obtained.

6. Conclusions

A variable selection problem has been analysed in this work for use in Principal Component Analysis (PCA). In this case,the set of original variables is divided (partitioned) into disjoint groups. Each group corresponds to a variable type. Theproblem consists in the selection of variables, but with the restriction that the set of variables that is selected should haveat least one variable of each type or group.

This problem, as previously explained, has a wide scope of application, above all for the design of composite indicatorsin sociological and economic studies, among others. The different groups of variables correspond to different viewpoints oraspects of the problem under analysis and these synthetic indicators should encompass all these aspects. Therefore, theseindicators should contain variables from all of the groups. The objective function under consideration is the sum of the firstk (k = 2, 3) eigenvalues of the correlation matrix of the subgroup of selected variables.

The problem expressed in this way has no earlier known references. In addition to the complexity associated withthe selection of variables, the problem has two added difficulties: the evaluation of the objective function and theaforementioned restriction that the subset of the selected variables should contain elements from all of the groups.

In this work, two Branch and Bound methods have been proposed to obtain exact solutions. Two strategies wereproposed, in order to overcome or at least to alleviate or ‘‘soften’’ the aforementioned difficulties: the first is the use offast ‘‘guide’’ functions for evaluation in the division processes, as alternatives to the objective function; the second strategyproposes ‘‘filtering’’ or preselecting those variables, at each node of the search tree, that ‘‘help’’ to fulfil the restriction thatthere should be elements from each group.

110 J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111

In the computational tests, it was noted that each of the two strategies was by itself really efficient. Important andsignificant reductions in calculation time were achieved. Their joint use led to reductions of more than 65% in large-sizedproblems.

In any case, the use of exact methods can only be used in problems with moderate size due to the calculation time thatthey use. In this work, problems with 24 original variables have been solved, and even though immediate solutions are notrequired in the majority of applications, it might not be viable to use exact methods for problems involving more than 40variables. In future works, the design of heuristic methods that may be used in larger-size problems will be analysed.

Finallywe have to point out that the problem addressed in thiswork can be considered or formulated as aMulti-objectiveOptimization problem. From our point of view we have 3 objectives: (1) minimizing k (in order to have the simplest aspossible representation of the dates), (2) minimizing p (in order to reducing costs in taken the dates and other profits ofvariable selection explained in the work and other referred works), (3) maximizing sum of the k highest eigenvalues (inorder to maximizing the variance of components). These objectives are in ‘‘conflict’’ as it is explained in the work. Whenthe Branch and Bounds methods are executed with different input values the set of non-dominated solutions is obtained.Then from this set must be selected the best solution (combination) depending on the point of view and/or preferences ofthe analyst. Very often this selection is performed by building mixed objectives functions in which these preferences areincorporated. In any case it is always recommended to perform a deep analysis of the set of non-dominated solutions.

Acknowledgements

This work was partially supported by FEDER and the Spanish Ministry of Science and Innovation (Project ECO2008-06159/ECON), the Regional Government of Castile-Leon (Spain) (Project BU008A10-2), the University of Burgos and‘‘CajaBurgos’’ (Spain, Internal Projects). Their support is gratefully acknowledged.

Appendix

Corollary 1. If S ⊂ S ′, then f (S) ≤ f (S ′).With no loss of generality and to simplify the demonstration, let us assume that k = 2 and that S ′ contains just one element

more than S, for example S = {1, 2, . . . , t} and S ′= {1, 2, . . . , t, t + 1}. Moreover, consider the vectors u1, u2 ∈ Rt , such that

f (S) = v′

1 · Σ (S)· v1 + v′

2 · Σ (S)· u2 ∥v1∥ = ∥v2∥ = 1y v′

1 · v2 = 0.

Subsequently, v∗

1 , v∗

2 ∈ Rt+1 are defined as follows: v∗

1 = (v1, 0) and v∗

2 = (v2, 0), i.e., v∗

1 and v∗

2 have the same componentsas v1 and v2, adding 0 in the position t + 1. Therefore, ∥v∗

1∥ = ∥v∗

2∥ = 1, v∗′

1 · v∗

2 = 0 and moreover:

f (S) = v′

1 · Σ (S)· v1 + v′

2 · Σ (S)· v2 = v∗′

1 · Σ (S′)· v∗

1 + v∗′

2 · Σ (S′)· v∗

2 ≤ f (S ′).

It may be confirmed, for example, that

v∗′

1 · Σ (S′)· v∗

1 =v′

1, 0·

Σ (S) aa′ σt+1t+1

·

v10

=v′

1 · Σ (S), v′

1 · a·

v10

= v′

1 · Σ (S)· v1

where, a = (σ1t+1, σ2t+1, . . . , σtt+1)′∈ Rt .

The consequence of this first corollary is the next one.

Corollary 2.

∀p, p′∈ {k + 1, . . . , n − 1}, if p < p′then g(p) ≤ g(p′).

In effect, by simplifying let us define p′= p + 1, and S∗

p = {1, . . . , p}, and S ′= {1, . . . , p, p + 1}.

Obviously, S∗p ⊂ S ′. In addition, S ′

∩ Gr = ∅ because S∗p ∩ Gr = ∅, r = 1, . . . , q. Therefore

g(p) = f (S∗

p ) ≤ f (S ′) ≤ max{f (S)/|S| = p + 1, S ⊂ V , S ∩ Gr = ∅r = 1, . . . , q} = g(p + 1).

References

Bandura, R., 2008. A survey of composite indices measuring country performance: 2008 update. Working Paper. Office of Development Studies, UnitedNations Development Programme.

Bonifas, I., EscouBer, Y., Gonzalez, P.L., Sabatier, R., 1984. Choix de variables en analyse en composants principales. Revue de Statistique Appliquée 32 (2),5–15.

Brusco, M.J., Cradit, J.D., 2001. A variable-selection heuristic for k-means clustering. Psychometrika 66, 249–270.Brusco, M.J., Singh, R., Steinley, D., 2009a. Variable neighborhood search heuristics for selecting a subset of variables in principal component analysis.

Psychometrika 74, 705–726.Brusco, M.J., Steinley, D., 2010. Neighborhood search heuristics for selecting hierarchically well-formulated subsets in polynomial regression. Naval

Research Logistics 57, 33–44.

J. Pacheco et al. / Computational Statistics and Data Analysis 57 (2013) 95–111 111

Brusco, M.J., Steinley, D., 2011. Exact and approximate algorithms for variable selection in linear discriminant analysis. Computational Statistics and DataAnalysis 55 (1), 123–131.

Brusco, M.J., Steinley, D., Cradit, J.D., 2009b. An exact algorithm for hierarchically well-formulated subsets in second-order polynomial regression.Technometrics 51 (3), 306–315.

Cadima, J., Cerdeira, J.O., Minhoto, M., 2004. Computational aspects of algorithms for variable selection in the context of principal components.Computational Statistics and Data Analysis 47, 225–236.

Cadima, J., Jolliffe, I.T., 1995. Loadings and correlations in the interpretation of principal components. Journal of Applied Statistics 22 (2), 203–214.Cadima, J., Jolliffe, I.T., 2001. Variable selection and the interpretation of principal subspaces. Journal of Agricultural, Biological, and Environmental Statistics

6 (1), 62–79.Calzada, J.M., et al. 2011. Boletín de Coyuntura Económica. No 4, Caja Rural Burgos.Chan, Ying Keung, Andy Kwan, Cheuk Chiu, Daniel Shek, Tan Lei, 2005. Quality of life in Hong Kong: the CUHK Hong Kong quality of life index. Social

Indicators Research 71, 259–289.Cotta, C., Sloper, C., Moscato, P., 2004. Evolutionary search of thresholds for robust feature set selection: application to the analysis of microarray data.

Lecture Notes in Computer Science 3005, 21–30.Duarte Silva, A.P., 2002. Discarding variables in principal component analysis: algorithms for all-subsets comparisons. Computational Statistics 17, 251–271.Falguerolles, A., Jmel, S., 1993. Un critère de choix de variables en analyse en composants principales fondé sur des modèles graphiques gaussiens

particuliers. Canadian Journal of Statistics 21 (3), 239–256.Francis, J.G.F., 1961–1962. The QR transformation, parts I and II. The Computer Journal (4), pp. 265–271 and 332–345.Furnival, G.M., Wilson, R.W., 1974. Regressions by leaps and bounds. Technometrics 16, 499–511 (reprinted in Technometrics 42 (1) 69–79).Gatu, C., Kontoghiorghes, E.J., 2006. Branch-and-bound algorithms for computing the best-subset regression models. Journal of Computational and

Graphical Statistics 15, 139–156.Gatu, C., Yanev, P., Kontoghiorghes, E.J., 2007. A graph approach to generate all possible regression submodels. Computational Statistics & Data Analysis 52

(2), 799–815.Hofmann, M., Gatu, C., Kontoghiorghes, E.J., 2007. Efficient algorithms for computing the best subset regression models for large-scale problems.

Computational Statistics & Data Analysis 52 (1), 16–29.Hogarty, K.Y., Kromrey, J.D., Ferron, J.M., Hines, C.V., 2004. Selection of variables in exploratory factor analysis: an empirical comparison of a stepwise and

traditional approach. Psychometrika 69, 593–611.Iizuka, M., Mori, Y., Tarumi, T., Tanaka, Y., 2003. Computer intensive trials to determine the number of variables in PCA. Journal of the Japanese Society of

Computational Statistics 15, 337–345.Jolliffe, I.T., 1972. Discarding variables in a principal component analysis-I: artificial data. Applied Statistics 21, 160–173.Jolliffe, I.T., 1973. Discarding variables in a principal component analysis-II: real data. Applied Statistics 22, 21–31.Jolliffe, I.T., 1989. Rotation of Ill-defined principal components. Applied Statistics 38, 139–147.Jolliffe, I.T., 2002. Principal Component Analysis, second ed. Springer-Verlag, New York.Kano, Y., Harada, A., 2000. Stepwise variable selection in factor analysis. Psychometrika 65, 7–22.Kohavi, R., 1995. Wrappers for Performance Enhancement and Oblivious Decision Graphs. Stanford University, Computer Science Department.Krzanowski, W.J., 1987a. Selection of variables to preserve multivariate data structure using principal components. Applied Statistics 36, 22–33.Krzanowski, W.J., 1987b. Cross-validation in principal component analysis. Biometrics 43 (3), 575–584.Krzanowski,W.J., Hand, D.J., 2009. A simplemethod for screening variables before clustering ofmicroarray data. Computational Statistics and Data Analysis

53, 2747–2753.Kublanovskaya, V.N., 1961. On some algorithms for the solution of the complete eigenvalue problem. USSR Computational Mathematics andMathematical

Physics 3, 637–657.López-García, A.M., Castro-Núñez, R.B., 2004. Valoración de la actividad económica regional de España a través de indicadores sintéticos. Estudios de

Economía Aplicada 22 (3), 1–21.McCabe, G.P., 1975. Computations for variable selection in discriminant analysis. Technometrics 17, 103–109.McCabe, G.P., 1984. Principal variables. Technometrics 26 (2), 137–144.McKay, R.J., Campbell, N.A., 1982a. Variable selection techniques in discriminant analysis: I. Description. British Journal of Mathematical and Statistical

Psychology 35, 1–29.McKay, R.J., Campbell, N.A., 1982b. Variable selection techniques in discriminant analysis: II. Allocation. British Journal of Mathematical and Statistical

Psychology 35, 30–41.Miller, A.J., 2002. Subset Selection in Regression, second ed. Chapman and Hall, London.Mori, Y., Iizuka,M., Tarumi, T., Tanaka, Y., 2007. Variable selection in principal component analysis. In: Härdle,W.,Mori, Y., Vieu, P. (Eds.), StatisticalMethods

for Biostatistics and Related Fields. Springer-Verlag, Berlin, pp. 265–284.Mori, Y., Tarumi, T., Tanaka, Y., 1994. Variable selection with RV-coefficient in principal component analysis. In: Dutter, R., Grossman, W. (Eds.), COMPSTAT

1994. Proceedings in Computational Statistics (Short Communications). pp. 169–170.Nardo, M., Saisana, M., Saltelli, A., Tarantola, S., 2005a. Tools for composite indicators building.Working Paper 21682. European Commission, Joint Research

Centre.Nardo, M., Saisana, M., Saltelli, A., Tarantola, S., Hoffman, A., Giovannini, E., 2005b. Handbook on constructing composite indicators: methodology and user

guide. Working Paper 2005/3. OECD Statistics.Naylor, T., 1977. Técnicas de Simulación en Computadoras. Limusa.Pacheco, J., Casado, S., Nunez, L., 2009. A variable selection method based on tabu search for logistic regression models. European Journal of Operational

Research 199, 506–511.Pacheco, J., Casado, S., Nunez, L., Gomez, O., 2006. Analysis of new variable selection methods for discriminant analysis. Computational Statistics and Data

Analysis 51, 1463–1478.Peixoto, J.L., 1987. Hierarchical variable selection in polynomial regression models. American Statistician 41, 311–313.Ramajo-Hernández, J., Márquez-Paniagua, M.A., 2001. Indicadores sintéticos de actividad económica: el caso de Extremadura. In: Cabrer-Borrás, , et al.

(Eds.), Análisis Regional: El Proyecto Hispalink. Mundi Prensa, pp. 301–312.Richman, M.B., 1992. Determination of dimensionality in eigenanalysis. In: Proceedings of the Fifth International Meeting on Statistical Climatology,

pp. 229–235.Robert, P., Escoufier, Y., 1976. A unifying tool for linear multivariate statistical methods. Applied Statistics 25, 257–265.Rubinstein, R.Y., 1981. Simulation and the Monte Carlo Method. Wiley.Steinley, D., Brusco, M.J., 2008. Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika 73, 125–144.Tanaka, Y., Mori, Y., 1997. Principal component analysis based on a subset of variables: variable selection and sensitivity analysis. American Journal of

Mathematical and Management Sciences 17 (1& 2), 61–89.Tangian, A., 2007. Analysis of the third European survey on working conditions with composite indicators. European Journal of Operational Research 181

(1), 468–499.