Automatic Induction of Heterogenous Proximity Measures for Supervised Spectral Embedding

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013 1575

Automated Induction of Heterogeneous ProximityMeasures for Supervised Spectral Embedding

Eduardo Rodriguez-Martinez, Tingting Mu, Member, IEEE, Jianmin Jiang,and John Yannis Goulermas, Senior Member, IEEE

Abstract— Spectral embedding methods have played a veryimportant role in dimensionality reduction and feature generationin machine learning. Supervised spectral embedding methodsadditionally improve the classification of labeled data, usingproximity information that considers both features and classlabels. However, these calculate the proximity information bytreating all intraclass similarities homogeneously for all classes,and similarly for all interclass samples. In this paper, we proposea very novel and generic method which can treat all the intra- andinterclass sample similarities heterogeneously by potentially usinga different proximity function for each class and each class pair.To handle the complexity of selecting these functions, we employevolutionary programming as an automated powerful formulainduction engine. In addition, for computational efficiency andexpressive power, we use a compact matrix tree representationequipped with a broad set of functions that can build mostcurrently used similarity functions as well as new ones. Modelselection is data driven, because the entire model is symbolicallyinstantiated using only problem training data, and no user-selected functions or parameters are required. We perform thor-ough comparative experimentations with multiple classificationdatasets and many existing state-of-the-art embedding methods,which show that the proposed algorithm is very competitive interms of classification accuracy and generalization ability.

Index Terms— Distance metric learning, evolutionaryoptimization, heterogeneous proximity information, spectraldimensionality reduction.

I. INTRODUCTION

CLASSIC feature extraction frequently relies on lineartechniques, such as principal component analysis [1]

and Fisher discriminant analysis (FDA) [2], to provide low-dimensional projections at low computational costs. However,recent evidence suggests that the use of nonlinear embeddingsof data originally lying in low-dimensional manifolds canbe more effective [3]. A number of unsupervised spectralembedding methods have been proposed [4]–[6], along withtheir linear out-of-sample extensions [7]–[10]. These preserve

Manuscript received July 10, 2012; revised November 10, 2012 and Feb-ruary 24, 2013; accepted April 30, 2013. Date of publication June 12, 2013;date of current version September 27, 2013. This work was supported byCONACyT under Scholarship 19629.

E. Rodriguez-Martinez, T. Mu, and J. Y. Goulermas are with the Departmentof Electrical Engineering and Electronics, The University of Liverpool, Liver-pool L69 3GJ, U.K. (e-mail: [email protected]; [email protected];[email protected]).

J. Jiang is with the School of Computer Science and Technology,Tianjin University, Tianjin 300072, China, and also with the Departmentof Computing, University of Surrey, Guildford GU2 7XH, U.K. (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNNLS.2013.2261613

certain characteristics of the original high-dimensional space.For instance, the locality preserving projections (LPP)method [8] and the orthogonal LPP (OLPP) [7] retain theaggregate pairwise proximity information based on localneighborhood graphs, while the orthogonal neighborhood pre-serving projections (ONPP) [7] keep the reconstruction errorfrom neighboring samples. Nevertheless, in an unsupervisedsetup, neighboring points near the class boundaries mayget projected to the wrong class, and this often increasesmisclassification rates. As such, several supervised spectralembedding (SSE) alternatives are proposed to alleviate thisproblem. From these, the methods closely related to FDA [11]–[16] use the between- and within-class information to restrictthe embeddings, whereas another class modifies the proximitydefinition to incorporate the label information through variousother formulations [17]–[24].

Despite the existence of a wide range of current SSEmethods, according to no free lunch analyses [25] there isno single method that can be optimal for all classificationproblems. In practice, different alternatives exist for choosingan SSE method for a given classification task. One possibilityis the selection of the best performing SSE method amongcurrently existing ones. Such a selection process could bemodeled as a computationally expensive grid search overall existing models, involving parameter training for eachindividual model. Although better search techniques exist toease the computational burden of direct search [26], [27],the selected model may be suboptimal because of likelyassumptions it makes regarding the characteristics of theproblem and data. Another alternative is to specifically designa suitable SSE method for the problem at hand. This taskmay need the involvement of human experts to analyze thedata, characterize the problem and eventually propose a newmathematical model.

In this paper, we provide a radically different human-competitive alternative to the design of bespoke SSE methods,which can also be envisaged as a systemic distance metriclearning approach. We pose the design process as a complexinference task, in which the optimal model is learnt solely fromthe dataset at hand via the means of a genetic programming(GP)-based model search. This GP search relies on a verynovel encoding scheme expressing each potential model as aset of similarity functions, which are then used to constructthe characteristic weighting matrix for the classic embeddingoptimization problem. To maintain a very broad solution spaceof possible models that our algorithm can create, we employ

2162-237X © 2013 IEEE

1576 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

Fig. 1. Example of a possible SSE configuration of the original space (left)to the embedding one (right), where the three friends of x1 are pulled closer,while its four enemies are pushed far away.

very generic and adequately expressive function and terminalsets, which can express not only a large number of existingSSE models, but can also create completely new models thathappen to be optimal for the problem at hand.

The fundamental difference between the proposed work andexisting SSE methods is that our algorithm is a generatorof models, as opposed to existing methods that individuallydefine fixed and specific models. With a classification task,our method can automatically discover the embedding modelthat optimally suits the user’s dataset. More importantly, as aclassification problem can have multiple classes, the proposedalgorithm can discover models that adapt to distinct intra-and interclass types of similarity information, whereas currentmethods do not differentiate between the nature that class orclass-pair similarities are established. The objective of theproposed method is that it has to be used for automaticdiscovery of the SSE model that optimally suits the user’sproblem, instead of requiring the user to evaluate each existingmethod individually. This not only accelerates model selection,but can also make it more robust by discovering a new bespokeSSE model that fits the dataset optimally.

The organization of this paper is as follows. Section II-Asuccinctly revisits previous SSE methods by uniformly formu-lating their proximity weighting measures, and also providesthe rationale for the proposed method. Section II-B describesthe proposed underlying model, the composition scheme of theproximity matrices and the embedding generation procedure.Section II-C introduces the designed evolutionary learningframework as a robust model selection procedure, and high-lights the expressive power of the employed function andterminal sets. The experimental setup and results are includedin Section III, and the conclusions in Section IV.

II. PROPOSED LEARNING FRAMEWORK

A. Motivation of Heterogeneous Proximity Measures

If we assume a set {xi ∈ Rm}n

i=1 of training samplescorresponding to discrete labels yi ∈ {1, . . . , c}, an SSEalgorithm generates n embeddings {zi ∈ R

b}ni=1 of b � m

dimensions each. These two sets can also be convenientlydenoted through the original n × m feature matrix X = [xi j ]and the n × b embedding matrix Z = [zi j ], with row vectorsthe original samples xi and the embeddings zi , respectively.

The supervised character of an SSE method implies thatits objective is the creation of a new configuration in the

embedding space, where the class structures and separabilitiesexisting in the original space are not only maintained, butalso reinforced. As it is shown in Fig. 1, this is achievableby increasing the similarities between friends (samples fromthe same class), but making enemies (samples from differentclasses) more distant. A straightforward way of implementingthis behavior is to minimize the weighted sum of all thepairwise embedding distances, as in the standard unsupervisedembedding, according to the following:

min{zi ∈Rb}n

i=1

1

2

n∑

i, j=1

wi j ‖zi − z j‖22 (1)

or equivalentlymin

Z∈Rn×btrace

[ZT LZ

](2)

with the similarity weights wi j corresponding to enemy pairs(i, j) treated differently from the friend ones. In (2), W =[wi j ] is the overall similarity matrix. L = D(W) − W is thestandard Laplacian one, where D(W) returns a diagonal matrixwith the i th element of the diagonal equal to

∑nk=1 wik .

We succinctly give examples of how different SSE methodsuse W to directly control and differentiate the friend andenemy vicinities. To facilitate comparison, we re-express allmethods in terms of their weights wi j .

The first of these methods, the discriminant neighborhoodembedding (DNE) [17], is based on a sample weightingdefined as follows:

wi j =⎧⎨

⎩

+1, if x j ∈ NF (xi , k) ∨ xi ∈ NF (x j , k)−1, if x j ∈ NE (xi , k) ∨ xi ∈ NE (x j , k)0, otherwise

(3)

where NF (xi , k) is the k-nearest friends of xi , and NE (xi , k)is its k-nearest enemies.

The supervised optimal LPP (SOLPP) method [21] inte-grates class prior probabilities into the weighting matrix via

wi j =

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

p2yi

si j (1 + si j ), if x j ∈ NF (xi , k)∨ xi ∈ NF (x j , k)

pyi py j si j (1 − si j ), if x j ∈ NE (xi , k)∨ xi ∈ NE (x j , k)

0, otherwise.

(4)

The term si j = exp(−‖xi − x j‖22/τ) is a Gaussian-based simi-

larity controlled by the parameter τ , and pk (k = 1, 2, . . . , c)is the kth class prior probability.

Another method of interest is the repulsion OLPP (OLPP-R)[18], based on a Laplacian L = Lc −βLr , defined as the linearcombination of a class Lc and a repulsion Laplacian Lr , forsome user-defined parameter β > 0. This is equivalent to thecalculation of the weights according to the following:

wi j =

⎧⎪⎪⎨

⎪⎪⎩

1nl, if yi = y j = l

−βsi j , if[x j ∈ N(xi , k) ∨ xi ∈ N(x j , k)

]

∧ yi �= y j

0, otherwise.

(5)

The alternative weight si j = (σ + ‖xi − x j‖22/(‖xi‖2

2+‖x j‖22))

−1 is used to define the repulsion Laplacian. nl is thenumber of samples in class l, and N(xi , k) is the k-nearestneighbors of pattern xi .

RODRIGUEZ-MARTINEZ et al.: AUTOMATED INDUCTION OF HETEROGENEOUS PROXIMITY MEASURES 1577

0 1 2 3 4 0.511.5

-2

-1

0

1

2

Fig. 2. Synthetic 3-D dataset (based on [28]) composed of three classes(the | and “S” parts in the dollar shape and the Swiss roll) with samples lyingon different manifolds.

Supervised ONPP (SONPP) [7] is the method that attemptsto reconstruct each sample by a linear combination of itsfriends. Thus, it only considers similarities between friendsbased on the following:

wi j ={w̃i j , if yi = y j

0, otherwise(6)

where w̃i j = mij +m ji −∑nk=1 mki mkj is computed from the

reconstruction coefficient matrix M = [mij ]. M is obtained byminimizing the reconstruction error as follows:

minM∈Rn×n

n∑

i=1

∥∥∥∥∥∥xi −

n∑

j=1

mij x j

∥∥∥∥∥∥

2

2subject to mij = 0 if yi �= y j

n∑

j=1

mij = 1. (7)

The matrix M has a block diagonal form given by M =diag(M1, M2, . . . , Mc), after a simultaneous re-ordering ofthe rows and columns to keep intraclass samples together. Inthis arrangement, the lth block corresponds to the lth class.

Another method is the repulsion ONPP (ONPP-R) [18],which amplifies neighboring enemy dissimilarities by incor-porating the repulsion Laplacian into SONPP using the fol-lowing:

wi j =

⎧⎪⎪⎨

⎪⎪⎩

w̃i j , if yi = y j

−βsi j , if[x j ∈ N(xi , k) ∨ xi ∈ N(x j , k)

]

∧ yi �= y j

0, otherwise

(8)

where si j is the alternative weight as used by OLPP-R.A final method we analyze here is the discriminant ONPP

(DONPP) [20]. This is very similar to ONPP-R, but it definesthe neighboring enemies and their corresponding weightsdifferently as follows:

wi j =

⎧⎪⎪⎨

⎪⎪⎩

w̃i j , if yi = y j

−β, if x j ∈ NE (xi , k)∨(or ∧) xi ∈ NE (x j , k)

0, otherwise.

(9)

Although the aforementioned SSE methods are based ondifferent ways of controlling the friend and enemy vicinities,

they are all relying on a uniform treatment of the weightsbetween friends, and between enemies. In this way, all friendsregardless of the class they belong to, are assigned weights,which are calculated via a common measure. Similarly, theweights for the enemy pairs are homogeneously calculated,regardless of the class pairs containing them. The underlyingassumption is that samples from each class lie on manifoldshaving similar geometrical configurations and densities. How-ever, real world datasets may contain multiple, intersecting, orpartially overlapped manifolds with different orientations anddensities [28], [29]. Additionally, the interclass configurationsmay also vary, so that the enemy dissimilarities cannot betreated in a homogeneous manner for all class pairs. Fig. 2shows such a possible scenario.

An SSE algorithm based on standard weight calculations,may display poor class separability and fail to generate embed-dings that appropriately maintain or reinforce friend proximityand enemy remoteness. Fig. 3 shows a representative demon-stration of this situation for the dataset of Fig. 2. In Fig. 3(a),the entire similarity matrix W is calculated using the samemeasure. The three friend (diagonal) blocks contain Gaussiankernel similarities, whereas the enemy (off-diagonal) blocksare based on negated Gaussian similarities. The correspondingembeddings in Fig. 3(e) are shown to have mixed samplesfrom the “|” and “S” shape classes and also relatively widescattering of the “|” and roll classes. Similar observationshold for the embeddings of Fig. 3(f) and (g) which rely onthe weight matrix W of Fig. 3(b) and (c), calculated withthe cosine similarity and the alternative weight measures,respectively. The problem in all three cases in Fig. 3(a), (b),and (c), is that some intraclass samples fail to stay closeenough in the embedding space, while some interclass samplesfail to keep adequately apart. This is because all of them usea single measure to estimate the weights wi j and this measuremay not be suitable for all classes. Nevertheless, althoughnone of the three alone can produce good embeddings, theircombination can. This is shown in Fig. 3(d) and it containsa weight matrix W heterogeneously composed of differentblocks Wi j directly taken from Fig. 3(a)–(c). The generatedembeddings in Fig. 3(h) show well separated and fairlycompact. A closer look at Fig. 3(d) shows more interblockcontrast, which is expected to be due to the calculation ofblocks by different similarity measures. This is not the casefor the previous weight images. For example, Fig. 3(c) showsW12 (corresponding to the negative similarities between “|”and “S”), W13 (corresponding to “|” and the roll) and W23(corresponding to “S” and the roll) to have similar contrastand alike similarity ranges of around [−2,−1]. The blocksW13 and W23 in Fig. 3(d) also show to be of similar contrast,which can be accepted that the two classes “|” and “S” areon average equivalently dissimilar to the roll class becauseof their distance from it. W12 is, however, shown to have adifferent range of around [−0.5, 0], which is higher that therange [−1.5,−0.5] of W13 and W23. This can be reflectedby the proximal positioning of “|” and “S” in the originalspace of Fig. (2). Although this is an artificially constructedexample, it demonstrates that the homogeneous treatmentof all friend and enemy sample pairs cannot always drive


200 400 600 800 1000

200

400

600

800

1000

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

200 400 600 800 1000

200

400

600

800

1000

-0.5

0

0.5

1

200 400 600 800 1000

200

400

600

800

1000

-2

-1

0

1

2

200 400 600 800 1000

200

400

600

800

1000 -2.5

-2

-1.5

-1

-0.5

0

0.5

1

)d()c()b()a(

-0.1 -0.05 0 0.05 0.1 0.15

-0.1

-0.05

0

0.05

0.1

0.15

-0.1 -0.05 0 0.05 0.1 0.15

-0.1

-0.05

0

0.05

0.1

0.15

-0.1 -0.05 0 0.05 0.1 0.15

-0.1

-0.05

0

0.05

0.1

0.15

-0.1 -0.05 0 0.05 0.1 0.15

-0.1

-0.05

0

0.05

0.1

0.15

)h()g()f()e(

Fig. 3. Proximity matrices W in (a)–(d) and their corresponding 2-D embeddings in (e)–(h), from processing the dataset of Fig. 2. A fully connected graphis used to calculate the wi j weights using: (a) Gaussian kernel (τ = 1.5), (b) cosine similarity, (c) alternative weight [18] (σ = 0.35), and (d) matrix Wcomposed of all previous three measures. For display purposes, all matrices are re-ordered to keep together samples from the same class. This gives riseto a 3 × 3 block partitioning, with the three blocks from top to bottom corresponding to the “|”, “S” and roll shapes of Fig. 2, respectively. Friend samplepair measurements are plotted within the diagonal blocks, while enemy ones within the off-diagonal blocks. The matrix W of (d) is assembled from blockstaken from the matrices of (a)–(c) as follows: W21 = WT

12 and W22 are Gaussian similarities taken from (a), W11 and W33 are cosine similarities from (b),and W31 = WT

13 and W32 = WT23 alternative weights from (c). Embedding generation was based on a supervised version of LPP with all edges connecting

enemies assigned negative similarities.

satisfactorily the friend proximity and enemy remoteness inthe embedding space. This is because the underlying opti-mization of (1) and (2) implies multiple objectives involvingthe friend and enemy embedding space proximity goals forall the different classes. These objectives often need differenttreatment depending on the data configurations, and this can beaccomplished through the use of weights wi j estimated fromdifferent sources.

B. Model Definition and Embedding Generation Procedure

The objective of this paper is to address the above issues,by allowing the proximity information stored in W to adaptto the various geometric characteristics of the manifolds, thediffering class distributions and their interrelationships. This isa type of generic multilevel distance metric learning applied toall levels of within- and between-class sample (dis)similarityevaluations. To incorporate heterogeneous proximity informa-tion and create a composite weight matrix W, a fairly generalmodel could be based on weights redefined as follows:

wi j =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

fl (xi , x j ), if x j ∈ NF (xi , kl)∧ xi ∈ NF (x j , kl)

with yi = y j = l−gpq(xi , x j ), if x j ∈ NE (xi , k pq, q)

∧ xi ∈ NE (x j , k pq , p)with yi = p, y j = q

0, otherwise.

(10)

In (10), the weight wi j for friends is controlled independentlyfor each lth class and diagonal block by a different similarity

measure fl (xi , x j ). To selectively enforce neighborhood local-ization, friendship is restricted to pairs of xi and x j that aremutually the kl -nearest friends of each other. The model allowsfor a different fl and kl for each class l ∈ {1, . . . , c}. Forthe enemies, we also define an individual similarity measuregpq for each class pair (p, q) ∈ {1, . . . , c}2 with p �= q . Forsimplicity, here we assume symmetry, such that gpq = gqp.For neighborhood control, pairwise enmity between xi and x j

is defined only for samples mutually appearing within the k pq-nearest enemies of each other. The set NE (xi , k pq , q) is thekpq-nearest enemies of xi from the class q , and its formationis based on a search with the defined proximity function gpq .

Using this model definition, the similarity matrix W cannow be expressed (assuming a simultaneous re-ordering ofits rows and columns, to keep intraclass samples together inblocks form) as follows:

W = F − G =

⎡⎢⎢⎢⎢⎢⎣

F1 −G12 −G13 . . . −G1c

−GT12 F2 −G23 . . . −G2c

−GT13 −GT

23 F3 . . . −G3c...

......

. . ....

−GT1c −GT

2c −GT3c . . . Fc

⎤⎥⎥⎥⎥⎥⎦. (11)

Each block Fl ∈ Rnl×nl contains the pairwise similarities

between samples from the lth class, and each off-diagonalblock Gpq ∈ R

n p×nq holds the pairwise similarities of samplesbetween classes p and q . The F and G above are the overallfriend and enemy matrices. All Fl and Gpq blocks can besparse depending on kl and k pq , and are symmetric as weconsider mutual friends and enemies.


The identification of this model requires the fine-tuningof the positive integer parameters kl , k pq , and the similarityfunctions fl and gpq . In the form proposed by (10), there arec parameters kl and fl for the friend weights, while c(c−1)/2parameters k pq and gpq for the enemy weights. For problemswith very large number of classes, the model can be restrictedto the use of c enemy blocks, by having gpq = gpr andkpq = k pr , ∀r �= q (and dropping block symmetry). In thiscase, only c enemy weight parameters k pq and gpq are needed,one for each class against all other.

The form of (10) and (11) is generic and can represent alarge type of heterogeneous proximity models. For example,assuming a single kl and fl for all c classes and a single k pq

and gpq for all enemy pairs, we can obtain (with comparableneighborhood control) other existing homogeneous proximitymodels, such as DNE, OLPP-R, SONPP, and so on. Themechanism we propose to use for the identification of themodel of (10) is described in detailed in Section II-C.

To generate the optimal embeddings Z, we can solve (2),subjected to different orthogonality constraints on Z, to keepthe embedding co-ordinates different. In our case, to facil-itate out-of-sample learning, we follow [8] and express theembeddings as linear combinations of the input features X,through the transformation Z = XP, where P is the m × bprojection matrix. Finally, to flexibly utilize the proximityinformation stored in W = F − G, we minimize the distancesbetween friends and maximize the distances between enemiesin the embedding space. This is done by obtaining the optimalprojections as follows:

P∗ = arg minP∈R

m×b

PT (XT G̃X+λIm×m

)P=Ib×b

trace[PT XT F̃XP

](12)

where F̃ = D(F) − F and G̃ = D(G) − G are the Laplacianforms of the friend and enemy proximity matrices F and G,respectively. The parameter λ is a regularization parameteracting as a mechanism for controlling the degree of importanceof the enemy information. For example, for λ = 0, the con-straint of (12) assumes the standard form ZT G̃Z = Ib×b [12],which enforces orthogonality of Z with respect to G̃. Whenλ → ∞ the constraint becomes PT P = Ib×b [7], whichenforces orthogonality of the projections. Assuming all fl andgpq are known, F̃ and G̃ can then be calculated, and theabove optimization can be directly solved using a generalizedeigen-decomposition involving the two matrix expressions asfollows: XT F̃X from the objective and XT G̃X + λIm×m fromthe constraint of (12).

Because the above setup is linear, it could be restrictive tothe classification performance of some problems. To incorpo-rate the handling of data nonlinearities into the embeddinggeneration, we make use of relation features that can capturethe nonlinear interactions between samples [22], [30], [31].This is typically achieved by replacing X with a matrix R,calculated using either a (dis)similarity measure, such as theEuclidean distance, Gaussian or cosine similarity between allsamples xi and x j , or employing a statistical distance analysisscheme, such as multidimensional scaling. In such case, thetransformation Z = RP is used in (12). A further advantage is

that when the original data is of high dimensionality (m � n),it becomes more efficient to process R rather than X, becauseR can only have up to n columns.

C. Model Identification and Evolutionary Setup

A straightforward way for identifying the numerical para-meters kl and k pq , and the proximity functions fl and gpq

of the model defined in (10), is to impose a specific setof ζ possible functionals (for example, all the ones used inexisting SSE methods) for the fl and gpq , and employ adiscrete, enumerative optimization or a grid search based onvalidation data and a classifier. However, this approach wouldbear certain disadvantages. First, having c different kl and fl ,and c(c − 1)/2 (or just c for the gpq = gpr case) differentk pq and gpq , would make the combinatorial requirementsvery high, because the search would be over ζ c(c+1)/2 (orζ 2c) configurations. Second, apart from the fine-tuning of thenumerical parameters kl and k pq , the parameters of each ofthe similarity functions fl and gpq would also need to besearched. Third, as we target a data-driven model identificationapproach, the most suitable proximity functions may not beincluded within the ζ available ones, unless prior informationis specified, which is not always the case.

To tackle the above problems, we follow an evolutionaryapproach and specifically we employ GP [32], [33]. GP isa stochastic, combinatorial search technique, used to evolvemathematical formulae or computer programs, and is muchmore efficient and effective than grid like search procedures.Its search capability is based on an initial population of possi-ble candidate functions, which through the systematic use ofbioinspired analogs of crossover, mutation, and reproduction,improve the solution quality of the population members overrepetitively evolved generations. Evolutionary optimization issuccessfully used for the automatic design of classificationsystems in [34]–[36]. GP specifically, is recently shown toderive decision trees (DT) [36], evolve classifiers [37], [38],or perform feature extraction [39], [40].

In this paper, we make use of GP to search for the optimalkl , k pq , fl , and gpq , as well as the regularizer λ, the embeddingdimensionality b, and the various parameters required by thesimilarity functions fl and gpq . The whole process constitutesa model induction, that is a model selection procedure, but inthe symbolic sense as the algebraic forms of the similarityfunctions are unknown. The following sections describe indetail how we design the optimization objective that drives thesearch for the optimum model and the proposed mechanismthat encodes the optimizing parameters into a compact geneticrepresentation.

1) Overall Training Procedure: The actual model identifi-cation is dispatched by the GP using an objective functionthat measures the fitness of each chromosome, given train-ing data X and a classifier ψ . Each chromosome encodesa complete potential solution (model) defined as the set{kl, k pq , Fl ,G pq , λ, b}, with l, p, q ∈ {1, . . . , c} and p < q .Fl and G pq are matrix functions that calculate the entire blocksFl and Gpq of the weight matrix W in (11), and correspondto the vector similarity functions fl and gpq of (10) (this isdescribed in detail in the next Section II-C.2).


To achieve good generalization in model selection, weuse an out-of-sample error estimation based on h-fold cross-validation (CV). In each i th fold indexed by i = 1,. . ., h,the training data matrix X is randomly partitioned into thetraining row submatrix Ti and the validation Vi . Only Ti isused to find the optimal projections P∗ by solving (12) usingλ, and the friend F̃ and enemy Laplacian G̃ calculated from allthe kl , k pq , Fl , and G pq contained in the chromosome underevaluation. Then, Vi is used to measure the performance ofthe classifier ψ trained with Ti . The overall model selectionprocedure is described by the bilevel optimization problem asfollows:

minkl ∈{1,...,nl−1},

kpq ∈{1,...,min(n p,nq )},Fl ,G pq∈F ∀l,p,q

∈{1,...,c} : p<qlog10(λ)∈[−2,4],

b∈{1,...,m}

h∑

i=1

E(ψ,Ti P∗,Vi P∗, y

)

s.t.(

TTi F̃Ti

)P∗ =

(TT

i G̃Ti + λIm×m

)P∗�∗. (13)

E(·) is the error of the classifier ψ trained with embeddingsTi P∗ and tested with the validation embeddings Vi P∗, giventhe label vector y of all the samples in the rows of X. F isthe set of possible functions that can be expressed by our GPmodule and its function set (discussed in Section II-C.3). Inthe first level optimization of (13), the GP searches for the bestmodel and its parameters {kl, k pq , Fl ,G pq , λ, b}. The secondlevel optimization, as defined by the equality constraint, isresponsible for finding the optimum projection matrix P∗ givena complete model, and is realized as a generalized eigenvalueproblem producing the modal matrix P∗ and its associatedspectral matrix �∗.

We refer to the proposed method as the supervised embed-ding with heterogeneous proximities (SEHP). The main stepsfor estimating the classification error of a potential model aresummarized in Table I. This error represents the inverse of thechromosome fitness Q(·) for the GP search. The sequencingof the main GP operations is outlined in Table II. Detailsregarding the GP configuration and the various operators weuse, are provided in Sections II-C.2 and II-C.3, while detailson the classifier ψ , partitioning h, and other experimentalparameters in Section III.

2) Embedding Model Encoding: A critical part of a GP-based search is the way potential solutions are represented asmembers of the population. As discussed above, each solutionin our system is fully represented as a composite chromosomeof the form {kl, k pq, Fl ,G pq , λ, b} with l, p, q ∈ {1, . . . , c}and p < q . The numerical parameters kl , k pq , λ, and bare represented in the chromosome using standard integerand real-valued encoding following their numerical rangesin (13). The similarity functions fl , gpq : R

m × Rm → R

from (10), can be encoded using syntax tree structures, witheach individual function represented by a separate tree withinthe chromosome. The simplest way for encoding each treeis to mix certain basic operations (from a function set thatincludes scalar multiplication, addition, power, and so on)and operands (from a terminal set that includes variablesand numerical constants). Each tree can then correspond to

TABLE I

MAIN STEPS OF SEHP FOR THE COMPUTATION OF THE MODEL FITNESS,

GIVEN A CHROMOSOME AND TRAINING SAMPLES

1) Input: Training samples X, their labels y, CV partitioning h, andchromosome {kl , kpq , Fl ,G pq , λ, b}.

2) Initialization:a) Partition X to h different folds, with each containing a row sub-

matrix T j for training and its complementary V j for validation,for j = 1, . . . , h.

b) Set i = 1.3) Main loop: While i ≤ h do:

a) Calculate blocks Fl and Gpq for all l, p, q ∈ {1, . . . , c} withp < q, using the provided kl , kpq , Fl , and G pq , as well as thetraining samples Ti and their corresponding labels from y.

b) Form the overall friend F and enemy G matrices, according to(11) and Section II-C.2.

c) Calculate their Laplacian versions F̃ and G̃ as in (12).d) Determine the optimal projection matrix P∗, by solving the

generalized eigenvalue problem of (13), using the given λ andretaining b eigenvectors.

e) Train the classifier ψ using the training embeddings Ti P∗ andtheir corresponding labels from y.

f) Test the classifier ψ using the validation embeddings Vi P∗ andtheir corresponding labels from y.

g) Record this classification error as Ei ≡ E(ψ,Ti P∗,Vi P∗, y).h) Set i = i + 1.

4) Output: Return the fitness for the given chromosome, defined throughthe average validation error Q

({kl , kpq , Fl ,G pq , λ, b}) = h∑hi=1 Ei

.

TABLE II

DESCRIPTION OF THE MAIN OPERATIONS OF THE GP SEARCH MODULE

USED IN SEHP FOR MODEL SELECTION

1) Input: Training samples X, their labels y, number of generationsGPgen, population size GPpop, as well as other GP parametersdescribed in Table V.

2) Initialization:a) Create randomized and seeded individuals encoded as chromo-

somes ξi ≡ {kl , kpq , Fl ,G pq , λ, b}i , for i = 1, . . .,GPpop.b) Compute the fitness value Q(ξi ) for each chromosome.c) Set t = 1.

3) Main loop: While t ≤ GPgen do:a) Select stochastically parents, and apply crossover and mutation

to reproduce children ξci , for i = 1, . . . ,GPpop–GPelite.

b) Compute the fitness value Q(ξci ) for each child, and insert it

into the population.c) Keep the best GPpop–GPelite individuals from the newly formed

population, and also transfer the top GPelite ones from the oldpopulation.

d) Set t = t + 1.4) Output: Locate the fittest individual ξ∗ in the final population, and

extract its corresponding parameters kl , kpq , Fl , G pq , λ, and b toconstitute the optimum model.

a potential function that receives an input of two m-lengthvectors and returns a scalar measure of their similarity.

However, here we adopt a different design for the trees, byemploying a very compact matrix representation that processesentire sets of samples and generates the entire blocks Fl andGpq of (11) directly. Specifically, instead of representing astrees each of the functions fl and gpq , and then applying themto all sample pairs that belong to each block Fl and Gpq ,we encode matrix functions of broader syntax. These are the


TABLE III

EXAMPLE OF THE VECTOR AND MATRIX FORMS FOR VARIOUS EXISTING

METRICS. THE tvec AND tmat COLUMNS ARE THE RESPECTIVE TIMES (S)

TAKEN TO CALCULATE ALL n × n VALUES OF ALL SAMPLE PAIRS FROM

A DATASET WITH n = 1000 SAMPLES AND m = 20 FEATURES

Function Expressions: Times:Vector Notation Matrix Notation tvec tmat

Cosinenorm

xTi x j

‖xi ‖2‖x j ‖2A(XXT )A,

A = (I ◦ XXT )−12

17.60 0.06

CorrelationxT

i Cm x j‖Cm xi ‖2‖Cm x j ‖2

A(XCmXT )A,

A = (I ◦ XCm XT )−12 ,

Cm = I − 1m 1m1T

m

21.51 0.05

Kullback-Leiblerdivergence

(xi − x j )T log2(xi )

+(x j − xi )

T log2(x j )

A + AT − (B + BT ),

A = (B ◦ I)1n1Tn ,

B = X log2(XT )

39.59 0.15

Euclideansquared

‖xi − x j ‖22 A + AT − 2XXT ,

A = (XXT ◦ I)1n1Tn

12.28 0.08

Fl : Rnl×m → R

nl×nl for each friend block Fl , and the G pq :R

n p×m × Rnq×m → R

n p×nq for each enemy block Gpq . Bothcan be used by splitting the n×m data matrix X into c differentnl × m matrices Xl , each with rows the samples from thelth class, and then directly calculating all blocks of (11) viaFl = Fl (Xl) and Gpq = G pq(Xp,Xq).

Although the functions fl and Fl are algebraically equiv-alent, and similarly for gpq and G pq , they have substan-tially different syntactic forms. The advantages gained bythe proposed matrix form are, initially, the compactness andparsimony of representation. This allows the genetic oper-ators to create more complex expressions easier and withfewer primitives. Table III contrasts the two forms for somewell known (dis)similarity functions, where the vector formprocesses sample pairs (xi , x j ), while the matrix one processesall the samples corresponding to the rows of a given X.Another advantage is the computational efficiency, as thecalculations of the vector form are repeated for all samplepairs independently, and this incurs redundant operations. Anexample of the speed differences is shown in Table III. Todemonstrate the expressive power of this representation, wealso show in Table IV how it can be used to express variousexisting methods reviewed in Section II-A. As these methodsare homogeneous, there is only one Fl and only one G pq forall blocks to be defined.

To benefit from this representation and allow the construc-tion of meaningful models, we employ a variation of GP,referred to as strongly typed GP (STGP) [41], where eachterminal is assigned a specific data type and each functionhas a return type determined by its arguments. STGP imposesstrict syntactic constraints and rules that govern the con-struction of potential models from all genetic operators. Toexploit the genetic information in the population and generatenew chromosomes from fitter parents, we use two types ofcrossover as follows: 1) one-point crossover that swaps the

TABLE IV

EXISTING HOMOGENEOUS SSE METHODS EXPRESSED USING THE

EMPLOYED MATRIX NOTATION, WITH SINGLE Fl AND SINGLE G pq . Nl

AND Npq CORRESPOND TO THE nl × nl AND n p × nq BINARY NEIGHBOR

MATRIX INDICATORS BETWEEN THE INTRA- AND INTERCLASS SAMPLES,

RESPECTIVELY, WITH ENTRIES OF ONE DENOTING NEIGHBORING

SAMPLES. Ml IS THE lth BLOCK OF THE RECONSTRUCTION COEFFICIENT

MATRIX M IN (7), A = 1n p 1Tnq

(I ◦ XT

q Xq

), B =

(I ◦ XT

p Xp

)1n p 1T

nq ,

AND C = 2XTp Xq

Method Fl G pq

DNE [17] 1nl 1Tnl

◦ Nl 1n p 1Tnq ◦ Npq

MMC [11] 2nl

1nl 1Tnl

− 1n 1nl 1T

nl1n 1n p 1T

nq

OLPP-R [18] 1nl

1nl 1Tnl

β1n p 1Tnq � (σ + (A + B − 2C)� (A + B)) ◦ Npq

SONPP [7] Ml + MTl − MT

l Ml 0

DONPP [20] Ml + MTl − MT

l Ml β1n p 1Tnq ◦ Npq

ONPP-R [18] Ml + MTl − MT

l Ml β1n p 1Tnq � (σ + (A + B − 2C)� (A + B)) ◦ Npq

TABLE V

MAIN PARAMETERS OF THE GP MODULE, ALONG WITH THEIR

DESCRIPTIONS AND EXPERIMENTAL SETTING

Description Parameter Value

Maximum number of generations GPgen 100

Number of individuals in the population GPpop 20

Ratio of seeded individuals in the initial population GPhyb 0.7

Crossover rate GPxov 0.9

Mutation rate GPmut 0.1

Tournament size GPtour 7

Elite members copied intact to the next generation GPelite 1

Maximum depth of each tree GPdepth 8

trees representing each matrix function Fl or G pq on eitherside of the break-point; and 2) is a selective crossover thatrandomly selects a tree in the same category from both parents,and swaps selected branches of the same return type. Thecorresponding numerical parameters kl and k pq for each tree,are swapped accordingly for both crossovers. To explore newareas in the solution space and to maintain genetic diversity,we use mutation that randomly selects tree branches fromany category and replaces them with newly generated ones ofthe same return type. For the numerical parameters, randomintegers and real values are selected from their correspondingranges kl ∈ {1, . . . , nl − 1}, k pq ∈ {1, . . . ,min(n p, nq )}, λ ∈[10−2, 104], and b ∈ {1, . . . ,m}. The starting population isinitialized using a ramped half-and-half setting, where half ofthe trees are of full depth GPdepth and half of varying depth upto GPdepth. To speed up the search we also hybridize a fractionGPhyb of the GPpop chromosomes in the starting population, byrandomly seeding their trees with existing similarity functions,in addition to the completely randomly generated ones. Themain parameters of the GP implementation we employ aresummarized in Table V.

3) Function and Terminal Primitives: The nodes of thetrees encoding Fl and G pq store different function or ter-


TABLE VI

SET OF FUNCTION PRIMITIVES USED TO COMPOSE THE SYNTAX TREES

FOR Fl AND G pq . C IS THE OUTPUT MATRIX EVALUATED FROM INPUT A

(FOR ARITY 1) OR INPUTS A AND B (FOR ARITY 2)

Symbol Arity Description

+,−,× 2 Matrix addition, subtraction, andmultiplication.

◦,� 2 Hadamard product and division.

/ 2 Right matrix division. If B is square,C = A/B = AB−1, C = AB†, otherwise.

\ 2 Left matrix division. If A is square,C = A\B = A−1B, C = A†B, otherwise.

trans 1 Matrix transpose.

ones 2 Vector or matrix of ones given byC = 1n1T

m ∈ Rn×m .

sumdiag 1 If A ∈ Rn × R

m , C = I ◦ (A1m1Tn ),

if A ∈ Rn , C = I ◦ (A1T

n ).

pdiag 1 Extracts the diagonal of a given squarematrix as C = (I ◦ A)1n .

recoef 1 Reconstruction coefficient matrix [5].

invsqrt 1 Diagonal matrix C = (I ◦ A)−1/2.

center 1 Centering matrix C = I − 1m 1m 1T

m .

exp,log2 1 Element-wise exponential ci j = eai j andlogarithm ci j = log2(ai j ).

prob 1 ci j = ai j − min(ai )/(aTi − min(ai )1T

m)1m .

minal primitives. The function and terminal sets have to beadequately rich to maximize the expressive power of therepresentation and allow the potential formation of diverseweight matrices W. The compactness of the proposed matrixencoding facilitates this because only a few basic functionprimitives are adequate for expressing a very wide set of blocksimilarities.

Table VI summarizes the set of primitives comprising ourfunction set. These include matrix addition (+), subtraction(−) and multiplication (×), Hadamard product (°) and division(�), and other specialized operations, such as the generationof a centering matrix (center), and vectors or matrices of equalelements (ones), processing of the diagonal of a matrix (pdiag,invsqrt), and conversion of an arbitrary vector to a vectorof probabilities (prob). We also include left (\) and right (/)matrix division implemented with (pseudo) inversion, transpo-sition (trans), matrix row-sum generation (sumdiag), genera-tion of LLE style reconstruction coefficient matrices (recoef)and element-wise matrix exponential (exp) for Gaussiansimilarities or base two logarithm (log2) for the divergence.The terminal set is indirectly defined by the above function set,and consists of matrices and submatrices involving the originaldata as well as various numerical constants and parameters.

TABLE VII

DATASET SUMMARY SHOWING THE TOTAL NUMBER OF CLASSES (c),

TOTAL NUMBER OF SAMPLES (n), DIMENSIONALITY (m), CLASS

IMBALANCE RATIO (γ ), AS WELL AS THE CLASSIFICATION ERRORS

OBTAINED BY THE TWO REGULAR CLASSIFIERS SVM AND DT

Dataset c n m γ SVM DT

dna 3 3186 180 0.31 19.10 13.29

usps 10 9298 256 0.08 13.50 20.09

mfeat 10 2000 649 0.11 15.49 24.93

mnist 10 10 000 778 0.10 19.25 28.75

reuters 11 2959 69 593 0.09 12.98 24.13

thyroid 2 215 5 0.43 10.71 7.92

diabetes 2 768 8 1.87 23.30 28.90

flare-solar 2 1066 9 1.23 55.25 37.53

heart 2 270 13 1.25 15.56 25.19

german 2 1000 20 0.43 30.20 29.00

cancer 2 569 30 1.68 2.46 7.55

musk1 2 476 166 1.30 22.89 24.83

musk2 2 2000 166 0.20 17.13 16.82

secom 2 1567 341 0.07 15.38 11.00

gisette 2 7000 5000 1.00 15.49 24.93

arcene 2 200 10 000 0.79 15.00 27.00

dexter 2 600 20 000 1.00 6.83 12.17

dorothea 2 1150 100 000 0.11 9.74 11.86

III. EXPERIMENTATIONS AND RESULTS

For performance evaluation we use the 18 labeled datasetssummarized in Table VII. The reuters data is extractedfrom the Reuters-21578 Text Categorization Test Collection,whereas the remaining datasets are downloaded from the UCIMachine Learning and the LIBSVM repositories. The tablealso includes individual dataset characteristics, such as theclass imbalance ratio γ that is the average ratio of the numberof in-class to the number of out-class samples. Datasets withγ ≈ 1 are perfectly balanced, while those with large or smallγ values are imbalanced. To show the potential class separabil-ities of the original features for each of these datasets, we listtheir classification error rates obtained with two regular classi-fiers based on linear support vector machines (SVM) and DT.

The proposed algorithm SEHP1 is compared with sevenexisting methods, including MMC, DNE, SONPP, DONPP,OLPP-R, ONPP-R, and a kernel extension of FDA (KFDA).Relation features are used as the input to these embeddingmethods, for datasets with features-to-samples ratios m/nhigher than 8% to reduce the computational complexity. Theprocedure for generating the relation features is discussed inSection II-B, and is based on processing Euclidean distanceswith multidimensional scaling computed according to [30]. Itis applied to all datasets apart from the reuters one, for whichwe use the cosine similarity. This is because it is common intext analysis to capture similarities between documents withscaled word co-occurrences calculated with cosine relation

1Our code is publicly available at http://pcwww.liv.ac.uk/∼goulerma/software/sehp.zip.


TABLE VIII

MEDIAN CLASSIFICATION ERRORS USING LDA AND 10-FOLD CV, WITH THE BEST PERFORMANCES SHOWN BOLDFACED AND THE SECOND BEST

UNDERLINED. THE MEDIAN VALUES OF THE DIMENSIONALITIES b ARE SHOWN IN PARENTHESES. THE BOTTOM THREE ROWS CORRESPOND TO THE

STATISTICAL TESTS

Dataset SEHP KFDA MMC DNE SONPP DONPP OLPP-R ONPP-R

dna 10.60 (179) 36.51 (2) 10.62 (8) 11.37 (16) 17.43 (180) 17.43 (128) 24.11 (164) 11.56 (38)

usps 10.20 (9) 27.59 (9) 12.47 (8) 14.21 (54) 10.71 (45) 13.04 (184) 15.95 (54) 10.76 (54)

mfeat 12.29 (36) 61.01 (9) 13.13 (141) 13.17 (103) 12.90 (140) 12.99 (41) 14.51 (121) 12.61 (101)

mnist 13.00 (286) 30.48 (9) 13.88 (71) 14.15 (65) 13.48 (63) 21.25 (122) 14.20 (63) 13.98 (38)

reuters 5.12 (138) 5.43 (9) 6.30 (111) 5.71 (113) 8.20 (141) 6.39 (161) 8.20 (61) 5.55 (61)

thyroid 10.22 (4) 13.74 (1) 12.56 (1) 12.09 (4) 12.56 (4) 25.58 (4) 13.02 (4) 21.86 (2)

diabetes 23.84 (8) 32.29 (1) 24.48 (5) 23.57 (6) 24.09 (8) 27.49 (7) 25.13 (7) 27.86 (6)

flare-solar 31.95 (2) 42.46 (1) 33.70 (9) 34.62 (2) 34.32 (9) 31.46 (8) 37.34 (7) 38.56 (2)

heart 13.33 (9) 22.90 (1) 15.62 (5) 14.81 (11) 16.30 (13) 16.67 (10) 17.41 (9) 14.44 (10)

german 28.10 (2) 36.20 (1) 33.50 (11) 29.87 (13) 27.40 (19) 31.66 (19) 38.00 (19) 31.40 (15)

cancer 1.94 (29) 3.33 (1) 3.69 (29) 3.51 (23) 3.69 (30) 4.56 (10) 4.75 (22) 4.27 (12)

musk1 20.61 (24) 26.33 (1) 22.48 (4) 20.80 (15) 22.06 (26) 30.25 (15) 23.11 (140) 23.74 (6)

musk2 11.81 (14) 19.03 (1) 13.25 (16) 12.60 (19) 12.64 (131) 13.65 (307) 13.77 (131) 13.45 (19)

secom 19.47 (3) 45.88 (1) 34.91 (12) 27.19 (86) 22.02 (243) 40.08 (73) 21.70 (241) 26.55 (73)

gisette 4.23 (197) 31.68 (1) 4.78 (698) 4.47 (532) 4.85 (698) 4.65 (351) 4.76 (421) 4.65 (281)

arcene 12.50 (51) 14.92 (1) 15.50 (1) 13.50 (17) 12.00 (175) 13.50 (2) 14.75 (25) 12.50 (40)

dexter 5.17 (125) 8.64 (1) 5.67 (537) 7.32 (36) 6.00 (537) 5.17 (487) 5.83 (163) 5.33 (433)

dorothea 9.56 (14) 24.28 (1) 10.73 (1) 10.60 (1) 10.57 (16) 10.03 (17) 10.61 (9) 10.46 (5)

wins n/a 17 15 15 12 13 18 15

losses n/a 0 0 0 1 1 0 0

draws n/a 1 3 3 5 4 0 3

TABLE IX

MEDIAN CLASSIFICATION ERRORS USING SVM AND DT, FOR SELECTED

DATASETS AND THE BEST METHODS FROM TABLE VIII. MEDIAN

VALUES OF THE DIMENSIONALITIES b ARE SHOWN IN PARENTHESES

Classifier Method DNA Reuters Thyroid Flare-Solar Secom

SVM SEHP 20.04 (169) 9.83 (108) 10.26 (4) 36.7 (5) 12.08 (36)

SVM SONPP 23.45 (179) 9.94 (201) 10.97 (4) 53.01 (3) 26.47 (49)

SVM DONPP 24.58 (2) 17.30 (201) 10.32 (5) 52.53 (3) 12.94 (44)

DT SEHP 11.36 (127) 33.9 (112) 2.36 (12) 35.17 (5) 9.06 (15)

DT SONPP 50.10 (177) 34.41 (141) 2.36 (5) 35.46 (8) 18.63 (2)

DT DONPP 25.12 (2) 38.25 (201) 4.22 (5) 36.21 (1) 10.31 (1)

features. For all the methods, a 10-fold CV with randomfolds is employed for model assessment. For each of theten training-test partitions, model selection is conducted asdiscussed in Section II-C.1, by applying a 3-fold CV (h = 3)on the training instances. Regarding model selection forthe competing methods, a grid search is used to fine-tunetheir model parameters with resolutions that produce goodperformances in reasonable computation times. To measurethe quality of a given model we implement the classifier ψusing linear discriminant analysis (LDA) for efficiency, and forsimplicity since we use complex features, but we also reportresults using other classifiers, such as SVM and DT.

The GP part of SEHP is implemented using a modified ver-sion of GPTIPS.2 The various genetic operators and encoding

2http://sites.google.com/site/gptips4matlab.

are implemented as explained in Section II-C. The descriptionsof the main parameters used to configure the GP search moduleare given in Table V, along with their numerical settings thatremained fixed for all reported experiments. Overall, we didnot observe high sensitivity in the search with respect to mostof these variables, apart from the parameter GPhyb of theproposed hybridization scheme. A rather high ratio of 0.7 forthe seeded members in the initial population, allows for aneffective search to be rapidly deployed. This also supportsa relatively small number of generations GPgen = 100 andpopulation size GPpop = 20, and significantly acceleratesthe search. Lowering GPhyb leads to longer evolution timesand requires larger populations. In addition, a much highercrossover probability GPxov = 0.9 is required than the muta-tion GPmut = 0.1. This is because the starting populationcontains both completely random and seeded models, andbecause of this initial building block diversity, the exploitationaspect of the search sustained by the crossover becomesrelatively more important than the mutation-based explorationaspect.

Table VIII reports the median values of the LDA modelassessment classification errors over the 10-fold CV for alldatasets and methods. The performances of the various com-peting methods rank differently for different datasets. Theproposed SEHP provides the best performance for 14 out ofthe 18 datasets, whereas for the remaining four it has thesecond smallest error. To further analyze the resulting rates,we use the Wilcoxon signed-rank test to assess the statisticalsignificance of the performance differences, by considering


TABLE X

EXAMPLES OF EVOLVED SIMILARITY MEASURES FOR VARIOUS TWO-CLASS DATASETS, ALONG WITH THEIR CORRESPONDING HYPERPARAMETERS,

WHERE C = I − 1m 1m 1T

m AND X̃T = [XT

1 XT2 . . .X

Tc ]

Dataset Friend Blocks F1, F2 Enemy Block G12 Parameters [λ, b]thyroid F1 = invsqrt(A)A invsqrt(A), k1 = 58

A = X1XT1

F2 = 1n2

1n2 1Tn2, k2 = 24

G12 = A12, k12 = 41A = B + BT − BT B − sumdiag( 1

n 1n1Tn )

B = recoef(X̃)

[1.32, 4]

diabetes F1 = sumdiag(X1), k1 = 436F2 = sumdiag(X2), k2 = 436

G12 = A12, k12 = 436A = sumdiag(X̃)

[1.03, 8]

flare-solar F1 = ACAT , k1 = 76A = log 2(X1)+ log 2(prob(XT

1 ))T

F2 = sumdiag(X2), k2 = 27

G12 = A12, k12 = 181A = 1

n 1n1Tn

[0.18, 2]

heart F1 = A + AT − 2X1XT1 , k1 = 127

A = (X1 ◦ X1)1m1Tn1

F2 = recoef(X2)X2CXT2 , k2 = 47

G12 = A12, k12 = 102A = B + BT − (D + DT )B = pdiag(D)1T

n , D = X̃log 2(prob(X̃)T )

[0.31, 9]

german F1 = 0n1×n1F2 = sumdiag(X2), k2 = 15

G12 = A12, k12 = 24A = sumdiag(X̃)

[15.00, 15]

cancer F1 = sumdiag(X1), k1 = 77F2 = sumdiag(X2), k2 = 18

G12 = A12, k12 = 161

A = exp

(2X̃

log 2(X̃)� sumdiag

(X̃CX̃

T

1Tn

)� B

)

B = I − 1n 1n1T

n

[1.47, 27]

musk1 F1 = sumdiag(X1), k1 = 19F2 = invsqrt(1n2 1T

n2X2CXT

2 ), k2 = 4G12 = A12, k12 = 19A = exp(1n1T

n )

[0.69, 24]

musk2 F1 = A + AT − (B + BT ), k1 = 484A = pdiag(B)1T

n , B = X1log2(prob(X1)T )

F2 = DHD − sumdiag( 1n2

1n2 1Tn2), k2 = 53

D = invsqrt(H), H = X2XT2

G12 = A12, k12 = 110

A = X̃CX̃T

[0.23, 14]

secom F1 = 1n1 1Tn1

1n1 1Tn1, k1 = 1

F2 = 1n2 1Tn2

1n2 1Tn2, k2 = 1

G12 = 1n1 1Tn2, k12 = 1

[1.00, 3]

gisette F1 = sumdiag(X1), k1 = 147F2 = recoef(X̃), k2 = 91

G12 = A12, k12 = 19A = recoef(X̃)+ recoef(X̃)

[2.01, 146]

arcene F1 = sumdiag(2X1)� recoef(X1), k1 = 18F2 = sumdiag(X2), k2 = 65

G12 = A12, k12 = 45A = 1n1T

n

[0.01, 51]

dexter F1 = invsqrt(I − 1n1

1n1 1Tn1), k1 = 254

F2 = 0n2×n2

G12 = A12, k12 = 197A = 1n1T

n

[2.09, 125]

dorothea F1 = 1n1 1Tn1, k1 = 3

F2 = 1n2 1Tn2, k2 = 37

G12 = 1n1 1Tn2, k12 = 3 [2.00, 14]

the null hypothesis that the methods perform similarly at a5% level. The last three rows of Table VIII show the datasetfrequency for SEHP being better, worse or equivalent to eachof the competing methods. Wins/losses count the number ofdatasets for which SEHP is significantly better/worse, overthose cases where its error rate is lower or higher than thecompeting method. Draws count the number of datasets forwhich SEHP cannot be accounted to be significantly eitherbetter or worse than the competing method. Overall, SEHPholds an 83% fraction of wins over the remaining losses anddraws. Table VIII also shows that although some competingmethods find smaller embedding dimensionality b than SEHP,those yield worse classification performances. Although allmethods have equal chance of searching and adapting b to

improve classification rates, SEHP seems to have a much bettercapability in making use of the extra embedding dimensions.This is owed to the heterogeneous proximity information it canemploy to generate discriminating embeddings that preservebetter the intra- and interclass relationships.

We also report experiments by realizing the classifier ψ in(13) as a SVM or DT, instead of LDA. Table IX containsthe 10-fold CV classification errors for five representativedatasets, SEHP, and the best two methods in terms of winsfrom Table VIII. Changing the classifier does not affect thecompetitiveness of SEHP over the other methods. In termsof what classifier should be chosen by the user for SEHP,it should be noted that the GP search is only aware of theerror E(·), and thus, regardless of how the classifier forms


its classification boundaries, the search will try to discover amodel that generates embeddings that facilitate the classifier.This user flexibility in choosing classifier is not the case forother SSE methods because they have fixed models that cannotadapt to different classifiers. In terms of actual classificationrates for either SEHP or the other SSE methods, it is notpossible for any specific classifier to claim superiority overall datasets. For example, as seen in Tables VIII and IX,LDA, SVM, and DT can have different ranking with differentdatasets. Overall, a simple linear classifier, such as LDA, couldbe used with SEHP for efficiency, but also to force the searchto create a model with more powerful embeddings, becausea more sophisticated classifier would dilute the contributionof the embeddings to the final classification performance.Nevertheless, more complex classification boundaries can beused when the application requires it, and this applies to SEHPas well as all other SSE methods. Comparing the embeddingclassification performance in Tables VIII and IX, with thebaseline rates of the original features in the last two columns ofTable VII, it can be seen that for most of the datasets, a smallset of SEHP embeddings is able to offer comparable or betterclassification performance than the original high-dimensionalfeatures, and sometimes much better performance (e.g., themnist and gisette datas ets).

Table X contains examples of actual similarity matricesevolved by SEHP for the binary classification datasets. SEHPgenerates a diverse set of models to characterize the intra-and interclass proximities. For example, the same functionsumdiag is used for computing F1, F2 and G12 of the diabetesdata. For the cancer data, sumdiag is used to obtain F1 andF2, whereas a more complex function is evolved for G12.For the remaining datasets, different functions are evolved tocompute F1, F2 and G12. The proximity functions generatedby SEHP, such as those shown in Table X, can be used toguide practitioners in the field to design new SSE methodssuitable to their own class of problems. Fig. 4 provides thetree views of the resulting proximity formulae for selecteddatasets. Specifically, it displays the homogenous similarityfunctions in Fig. 4(a), the simple interclass ones in Fig. 4(b),and the fully heterogeneous measures in Fig. 4(c) where allblocks are estimated differently.

The complexity of the proposed method is principallyrelated to the function evaluations needed by the GP searchto identify the optimal model ξ∗ = {kl, k pq , Fl ,G pq , λ, b}, asshown in Table II and the first level optimization of (13).The total number of evaluations depends on the evolvedgenerations and the population size, and is equal to GPgen ×(GPpop − GPelite

) + GPpop, as the starting population requiresa complete set of GPpop evaluations. By GPgen

(≤ GPgen)

we denote the actual number of generations the GP evolves,because the search can terminate earlier than the maximumof GPgen generations. This is due to population convergencesatisfied when the predictive performance of the best individ-ual does not show improvement. Within each such functionevaluation, the fitness of the given model is calculated assummarized in Table I and the second level optimizationof (13). This involves three main substages. One is thecomputation of the matrix W in (11), which normally has

Fig. 4. Examples of tree representations of evolved proximity functions thatcorrespond to different friend and enemy blocks for the two-class datasets:(a) diabetes, (b) cancer, and (c) arcene.

an overall complexity of O(n2), for n samples. In our case,this requires the evaluation of all Fl and Gpq blocks from thecorresponding chromosome trees Fl and G pq with l, p, q ∈{1, . . . , c} and p < q . In the implementation, this is efficientbecause of the proposed matrix encoding scheme describedin Section II-C.2. The number of actual arithmetic operationsfor each block depends on the number of samples nl in eachlth class for Fl , and n p and nq for Gpq , as well as thealgebraic complexities of the trees encoding each Fl and G pq .The second and most computationally expensive substage isthe generalized eigen-decomposition which is of complexityO(n3), whereas the final LDA training is of O(nb2). Thecomplexity of the grid search for the competing methods,depends on the number of model parameters to be trained andtheir discrete values, as these control the overall number ofmodel evaluations. Each such evaluation involves the samesubstages as in the GP fitness evaluation above. Based onTable V, the GP requires less than 2000 evaluations in eachexperiment. For the grid search based methods, more than twoparameters exceed this number dramatically even for moderatediscretization; for example, for just three parameters with 20grid points each, 8000 evaluations are needed. The number ofGP evaluations required to reach the optimum model could bereduced by adapting independent restart strategies, such as theones proposed in [26].

Regarding the actual training times taken using our imple-mentations of SEHP and of the competing methods, weobserved the following. Depending on the dataset, the timefor a single evaluation from SEHP varies between less than asecond and 9 min, while for the competing methods betweenless than a second and 30 min. Therefore, depending on thetotal number of evaluations required, the total model optimiza-


tion time varies between 20 s and 10 h for SEHP, whereasbetween 2 s and over 200 h for the competing methods. Ingeneral, among the competing methods, MMC and SONPPrequire the least time, as they only require one parameter, thatis the dimensionality b. DNE follows with two parametersb and the number of nearest neighbors k, whereas DONPP,OLPP-R, and ONPP-R are the slowest ones as they requiremore parameters, such as b, k, and the mixing coefficient β.The overall time analysis for all datasets, ranks SEHP inthe middle of the methods in terms of speed. For example,for the large-scale dataset reuters, it takes around 21.5 minto evolve an optimal embedding model and its parameters,whereas it takes similar times for MMC (19.7 min) andSONPP (21.3 min) and longer for DNE (36.1 min) to find theiroptimal parameters. The remaining three competing methodsDONPP (57.5 h), OLPP-R (45.5 h) and ONPP-R (49.1 h) areconsiderably more expensive. In addition, among the studieddatasets, SEHP achieves the most satisfactory improvement forthe cancer dataset in 33.8 min. This is longer than the modelselection times of MMC (2.9 s), DNE (17.0 min) and SONPP(54.1 s), but much shorter than DONPP (4.3 h), OLPP-R(3.5 h) and ONPP-R (4.2 h). Of course, it is possible to reducethe model selection times for the latter methods by employingcoarser grids, but experiments show that this often leads tounsatisfactorily performances. A similar observation is madewhen using a genetic algorithm with coarser encoding and/orsmaller populations and fewer generations. This is because inboth cases, only a very small portion of the model space isscanned. The GP search of SEHP is much less affected byshort evolution times, as the inclusion of symbolically definedheterogeneous measures and hybrid information, provides itwith more degrees of freedom and flexibility. The lack ofsearching exhaustively for all numerical parameters, is likelycompensated by the ability to adapt the model shape and do sofor each class and class pair independently. Another point to bementioned, is that given a dataset, to compute its embeddingswith good class separability, it may be necessary to try severaldifferent embedding methods with validation sets and selectthe best one. This setup, however, may take a much longertime than SEHP.

IV. CONCLUSION

We introduced an original and very generic SSE algorithm,which we refered to as SEHP, for dimensionality reduction andfeature generation in high-dimensional labeled data. SEHP wasvery different from previous methods because it was not basedon fixed and homogeneous models for evaluating proximitiesbetween samples. Instead, it acted as an engine, capable ofautomatically generating a complete model without any priorknowledge of the dataset. The final model controlled all het-erogeneous proximities between friend instances within eachclass, and also between enemy pairs from different classes.Thus, the model captured better the local geometry and classstructure of the data. Hence, robust embeddings with enhancedclass separabilities can be achieved. The model constructionrelied on an efficient evolutionary optimization method tocarry out a powerful, symbolic and data adaptive search. It was

based on a novel and very compact syntactic representation ofthe potential solutions, with a matrix encoding that broughtnotable computational and expressive advantages. SEHP wasentirely data driven, did not require user parameterization andserved as a very attractive alternative to standard methodswhen there was no prior information available for the problemat hand. Experimental comparisons with multiple datasetsand existing methods showed that it was very competitive atreasonable computation times.

ACKNOWLEDGMENT

The authors would like to thank the three anonymousreviewers for their valuable comments and suggestions.

REFERENCES

[1] I. T. Jolliffe, Principal Component Analysis. New York, NY, USA:Springer-Verlag, 2002.

[2] R. A. Fisher, “The use of multiple measurements in taxonomic prob-lems,” Ann. Eugenics, vol. 7, no. 2, pp. 179–188, Sep. 1936.

[3] H. S. Seung and D. D. Lee, “The manifold ways of perception,” Science,vol. 290, no. 5500, pp. 2268–2269, Dec. 2000.

[4] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometricframework for nonlinear dimensionality reduction,” Science, vol. 290,no. 5500, pp. 2319–2323, Dec. 2000.

[5] S. Roweis and L. Saul, “Nonlinear dimensionality reduction bylocally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326,Dec. 2000.

[6] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionalityreduction and data representation,” Neural Comput., vol. 15, no. 6,pp. 1373–1396, Jun. 2003.

[7] E. Kokiopoulou and Y. Saadb, “Orthogonal neighborhood preservingprojections: A projection-based dimensionality reduction technique,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 12, pp. 2143–2156,Dec. 2007.

[8] X. He and P. Niyogi, “Locality preserving projections,” in Proc. Conf.Adv. Neural Inf. Process. Syst., 2003, pp. 1–8.

[9] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, “Face recognitionusing Laplacianfaces,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27,no. 3, pp. 328–340, Mar. 2005.

[10] D. Cai, X. He, J. Han, and H.-J. Zhang, “Orthogonal Laplacianfacesfor face recognition,” IEEE Trans. Image Process., vol. 15, no. 11,pp. 3608–3614, Nov. 2006.

[11] H. Li, T. Jiang, and K. Zhang, “Efficient and robust feature extraction bymaximum margin criterion,” IEEE Trans. Neural Netw., vol. 17, no. 1,pp. 157–165, Jan. 2006.

[12] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, “Graphembedding and extensions: A general framework for dimensionalityreduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1,pp. 40–51, Jan. 2007.

[13] Q.-X. Gao, H. Xu, Y.-Y. Li, and D.-Y. Xie, “Two-dimensional supervisedlocal similarity and diversity projection,” Pattern Recognit., vol. 43,no. 10, pp. 3359–3363, Oct. 2010.

[14] M. Sugiyama, “Dimensionality reduction of multimodal labeled data bylocal Fisher discriminant analysis,” J. Mach. Learn. Res., vol. 8, no. 5,pp. 1027–1061, May 2007.

[15] H. Zhang, W. Deng, J. Guo, and J. Yang, “Locality preserving andglobal discriminant projection with prior information,” Mach. Vis. Appl.,vol. 21, no. 4, pp. 577–585, Jun. 2010.

[16] T. Mu, J. Jiang, Y. Wang, and J. Y. Goulermas, “Adaptive dataembedding framework for multi-class classification,” IEEE Trans.Neural Netw. Learn. Syst., vol. 23, no. 8, pp. 1291–1303, Aug. 2012.

[17] W. Zhang, X. Xue, Z. Sun, Y. Guo, and H. Lu, “Optimal dimensionalityof metric space for classification,” in Proc. 24th Int. Conf. Mach. Learn.,Jun. 2007, pp. 1135–1142.

[18] E. Kokiopoulou and Y. Saadb, “Enhanced graph-based dimensionalityreduction with repulsion Laplaceans,” Pattern Recognit., vol. 42, no. 11,pp. 2392–2402, Nov. 2009.

[19] S. Zhang, “Enhanced supervised locally linear embedding,” PatternRecognit. Lett., vol. 30, no. 13, pp. 1208–1218, Oct. 2009.


[20] T. Zhang, K. Huang, X. Li, J. Yang, and D. Tao, “Discriminativeorthogonal neighborhood-preserving projections for classification,”IEEE Trans. Syst., Man, Cybern., Part B, Cybern., vol. 40, no. 1,pp. 253–263, Feb. 2010.

[21] W. K. Wong and H. T. Zhao, “Supervised optimal locallity preservingprojection,” Pattern Recognit., vol. 45, no. 1, pp. 186–197, Jan. 2012.

[22] T. Mu, J. Goulermas, J. Tsujii, and S. Ananiadou, “Proximity-based frameworks for generating embeddings from multi-output data,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2216–2232,Nov. 2012.

[23] S. Moon and H. Qi, “Hybrid dimensionality reduction method basedon support vector machine and independent component analysis,”IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 5, pp. 749–761,May 2012.

[24] C. Lin, I. W. Tsang, and X. Dong, “Laplacian embedded regressionfor scalable manifold regularization,” IEEE Trans. Neural Netw. Learn.Syst., vol. 23, no. 6, pp. 902–915, Jun. 2012.

[25] D. Wolpert, “The lack of a priori distinctions between learning algo-rithms,” Neural Comput., vol. 8, no. 7, pp. 1341–1390, 1996.

[26] A. Auger and N. Hansen, “Performance evaluation of an advanced localsearch evolutionary algorithm,” in Proc. IEEE Congr. Evol. Comput.,vol. 2. Sep. 2005, pp. 1777–1784.

[27] A. B. Jimenez, J. L. Lazaro, and J. R. Dorronsoro, “Finding optimalmodel parameters by deterministic and annealed focused grid search,”Neurocomputing, vol. 72, nos. 13–15, pp. 2824–2832, Aug. 2009.

[28] Y. Wang, Y. Jiang, Y. Wu, and Z. Zhou, “Spectral clustering on multiplemanifolds,” IEEE Trans. Neural Netw., vol. 22, no. 7, pp. 1149–1161,Jul. 2011.

[29] A. Goldberg, X. Zhu, A. Singh, Z. Xu, and R. Nowak, “Multi-manifoldsemi-supervised learning,” J. Mach. Learn. Res., vol. 5, pp. 169–176,Jul. 2009.

[30] E. Pekalska, P. Paclik, and R. P. W. Duin, “A generalized kernel approachto dissimilarity-based classification,” J. Mach. Learn. Res., vol. 2, no. 3,pp. 175–211, Mar. 2002.

[31] E. Pekalska, R. Duin, and P. Paclik, “Prototype selection fordissimilarity-based classifiers,” Pattern Recognit., vol. 39, no. 2,pp. 189–208, Feb. 2006.

[32] J. R. Koza, Genetic Programming: On the Programming of Computersby Means of Nature Selection. Cambridge, MA, USA: MIT Press, 1992.

[33] J. R. Koza, D. Andre, F. H. Bennet, and M. Kaene, Genetic ProgrammingIII: Darwinian Invention and Problem Solving. San Mateo, CA, USA:Morgan Kaufman, Apr. 1999.

[34] P. Gutierrez, C. Hervas-Martinez, and F. Martinez-Estudillo, “Logis-tic regression by means of evolutionary radial basis function neuralnetworks,” IEEE Trans. Neural Netw., vol. 22, no. 2, pp. 246–263,Feb. 2011.

[35] C.-K. Goh, E.-J. Teoh, and K. C. Tan, “Hybrid multiobjective evolu-tionary design for artificial neural networks,” IEEE Trans. Neural Netw.,vol. 19, no. 9, pp. 1531–1548, Sep. 2008.

[36] H. Zhao, “A multi-objective genetic programming approach to devel-oping Pareto optimal decision trees,” Decision Support Syst., vol. 43,no. 3, pp. 809–826, Apr. 2007.

[37] P. Espejo, S. Ventura, and F. Herrera, “A survey on the application ofgenetic programming to classification,” IEEE Trans. Syst., Man, Cybern.,Part C, Appl. Rev., vol. 40, no. 2, pp. 121–144, Mar. 2010.

[38] M. Zhang, X. Gao, and W. Lou, “A new crossover operator in geneticprogramming for object classification,” IEEE Trans. Syst., Man, Cybern.,Part B, Cybern., vol. 37, no. 5, pp. 1332–1343, Oct. 2007.

[39] L. Guo, D. Rivero, J. Dorado, C. R. Munteanu, and A. Pazos,“Automatic feature extraction using genetic programming: An applica-tion to epileptic eeg classification,” Expert Syst. Appl., vol. 38, no. 8,pp. 10425–10436, Aug. 2011.

[40] E. Rodriguez-Martinez, J. Goulermas, T. Mu, and J. Ralph, “Automaticinduction of projection pursuit indices,” IEEE Trans. Neural Netw.,vol. 21, no. 8, pp. 1281–1295, Aug. 2010.

[41] D. J. Montana, “Strongly typed genetic programming,” Evol. Comput.,vol. 3, no. 2, pp. 199–230, 1995.

Eduardo Rodriguez-Martinez received the M.Sc.degree in information and intelligence engineeringand the Ph.D. degree from the University of Liv-erpool, Liverpool, U.K., in 2008 and 2012, respec-tively.

He is currently with the Department of ElectricalEngineering and Electronics, Universidad AutonomaMetropolitana, Mexico City, Mexico. His currentresearch interests include the fields of evolutionarycomputation, feature extraction, pattern recognition,and robotics.

Tingting Mu (M’05) received the B.Eng. degreein electronic engineering and information sciencefrom the University of Science and Technology ofChina, Hefei, China, in 2004, and the Ph.D. degreein electrical engineering and electronics from theUniversity of Liverpool, Liverpool, U.K., in 2008.

She is currently a Lecturer with the School of Elec-trical Engineering, Electronics and Computer Sci-ence, University of Liverpool. Her current researchinterests include machine learning, data analysis andmathematical modeling, with applications to infor-

mation retrieval, text mining, and bioinformatics.

Jianmin Jiang received the B.Sc. degree from theShandong Mining Institute, Shandong, China, in1982, the M.Sc. degree from the China University ofMining and Technology, Beijing, China, in 1984, andthe Ph.D. degree from the University of Nottingham,Nottingham, U.K., in 1994.

He is currently a Professor of media computingwith the University of Surrey, Surrey, U.K. His cur-rent research interests include image/video process-ing in compressed domain, digital video coding,stereo image coding, medical imaging, computer

graphics, machine learning and AI applications in digital media processing,and retrieval and analysis.

Dr. Jiang is a Chartered Engineer, a fellow of IEE and RSA, a member ofEPSRC College, an EU FP-6/7 Evaluator, and a Consulting Professor withthe Chinese Academy of Sciences and Southwest University, Beijing, China.He received the Outstanding Overseas Chinese Young Scholarship Award(Type-B) from the Chinese National Sciences Foundation in 2000, andOutstanding Overseas Academics Award from the Chinese Academy ofSciences in 2004.

John Yannis Goulermas (M’98–SM’10) receivedthe B.Sc. degree (Hons.) in computation from theUniversity of Manchester Institute of Science andTechnology (UMIST), Manchester, U.K., in 1994,and the M.Sc. and Ph.D. degrees from the ControlSystems Center, UMIST, in 1996 and 2000, respec-tively.

He is currently a Reader with the School of Electri-cal Engineering, Electronics and Computer Science,University of Liverpool, Liverpool, U.K. His currentresearch interests include machine learning, combi-

natorial data analysis, data visualization and mathematical modeling, withapplications to biomedical engineering, bioinformatics, industrial monitoringand security.

Automatic Induction of Heterogenous Proximity Measures for Supervised Spectral Embedding

Documents

Transcript of Automatic Induction of Heterogenous Proximity Measures for Supervised Spectral Embedding