Reshetov LA Learning Incoherent Subspaces: Classification via Incoherent Dictionary Learning

11
J Sign Process Syst DOI 10.1007/s11265-014-0937-5 Learning Incoherent Subspaces: Classification via Incoherent Dictionary Learning Daniele Barchiesi · Mark D. Plumbley Received: 15 January 2014 / Revised: 1 June 2014 / Accepted: 22 July 2014 © Springer Science+Business Media New York 2014 Abstract In this article we present the supervised itera- tive projections and rotations (S- IPR) algorithm, a method for learning discriminative incoherent subspaces from data. We derive S- IPR as a supervised extension of our pre- viously proposed iterative projections and rotations (IPR) algorithm for incoherent dictionary learning, and we employ it to learn incoherent sub-spaces that model signals belong- ing to different classes. We test our method as a feature transform for supervised classification, first by visualis- ing transformed features from a synthetic dataset and from the ‘iris’ dataset, then by using the resulting features in a classification experiment. Keywords Subspace learning · Dictionary learning · Incoherent subspaces · Supervised classification · Feature transform · Sparse approximation 1 Introduction: Classification and Feature Transforms Supervised classification is one of the classic problems in machine learning where a system is designed to discriminate the class of an observed signal, having previously observed representative examples from the considered classes [5]. This work has been supported by the Platform Grant EP/K009559/1 and the Leadership Fellowship EP/G007144/1, both from the UK Engineering and Physical Sciences Research Council (EPSRC). D. Barchiesi () · M. D. Plumbley Centre for Digital Music, Queen Mary University of London E1 4NS, London, UK e-mail: [email protected] Typically, a classification algorithm consists of a training phase where class-specific models are learned from labelled samples, followed by a testing phase where unlabelled data are classified by comparison with the learned models. Both training and testing comprise various stages. Firstly, we observe a signal that measures a process of interest, such as the recording of a sound or image, or a log of the tempera- tures in a particular geographic area. Then, a set of features are extracted from the raw signals using signal processing techniques. This step is performed in order to reduce the dimensionality of the data and provide a new signal that allows generalisation among examples of the same class, while retaining enough information to discriminate between different classes. Following the features extraction step, a feature trans- form can be employed to further reduce the dimensionality of the data and to enhance discrimination between classes. Thus classification benefits from feature transforms espe- cially when features are not separable, that is, when it is not possible to optimise a simple function that maps features belonging to signals of a given class to the cor- responding class. A further dimensionalty reduction may be performed when dealing with high dimensional signals (such as audio or high resolution images) by fitting the parameters of global statistical distributions with features learned on portions of the signal. Models learned on dif- ferent classes are finally compared using a distance metric to the model learned from an unlabelled signal, which is typically assigned to the nearest class. 1.1 Traditional Algorithms for Feature Transform Two of the main feature transform techniques include prin- cipal component analysis (PCA)[10] and Fisher’s linear discriminant analysis (LDA)[5].

Transcript of Reshetov LA Learning Incoherent Subspaces: Classification via Incoherent Dictionary Learning

J Sign Process SystDOI 10.1007/s11265-014-0937-5

Learning Incoherent Subspaces: Classificationvia Incoherent Dictionary Learning

Daniele Barchiesi · Mark D. Plumbley

Received: 15 January 2014 / Revised: 1 June 2014 / Accepted: 22 July 2014© Springer Science+Business Media New York 2014

Abstract In this article we present the supervised itera-tive projections and rotations (S-IPR) algorithm, a methodfor learning discriminative incoherent subspaces from data.We derive S-IPR as a supervised extension of our pre-viously proposed iterative projections and rotations (IPR)algorithm for incoherent dictionary learning, and we employit to learn incoherent sub-spaces that model signals belong-ing to different classes. We test our method as a featuretransform for supervised classification, first by visualis-ing transformed features from a synthetic dataset and fromthe ‘iris’ dataset, then by using the resulting features in aclassification experiment.

Keywords Subspace learning · Dictionary learning ·Incoherent subspaces · Supervised classification · Featuretransform · Sparse approximation

1 Introduction: Classification and Feature Transforms

Supervised classification is one of the classic problems inmachine learning where a system is designed to discriminatethe class of an observed signal, having previously observedrepresentative examples from the considered classes [5].

This work has been supported by the Platform GrantEP/K009559/1 and the Leadership Fellowship EP/G007144/1,both from the UK Engineering and Physical Sciences ResearchCouncil (EPSRC).

D. Barchiesi (�) · M. D. PlumbleyCentre for Digital Music, Queen Mary University of LondonE1 4NS, London, UKe-mail: [email protected]

Typically, a classification algorithm consists of a trainingphase where class-specific models are learned from labelledsamples, followed by a testing phase where unlabelled dataare classified by comparison with the learned models. Bothtraining and testing comprise various stages. Firstly, weobserve a signal that measures a process of interest, such asthe recording of a sound or image, or a log of the tempera-tures in a particular geographic area. Then, a set of featuresare extracted from the raw signals using signal processingtechniques. This step is performed in order to reduce thedimensionality of the data and provide a new signal thatallows generalisation among examples of the same class,while retaining enough information to discriminate betweendifferent classes.

Following the features extraction step, a feature trans-form can be employed to further reduce the dimensionalityof the data and to enhance discrimination between classes.Thus classification benefits from feature transforms espe-cially when features are not separable, that is, when itis not possible to optimise a simple function that mapsfeatures belonging to signals of a given class to the cor-responding class. A further dimensionalty reduction maybe performed when dealing with high dimensional signals(such as audio or high resolution images) by fitting theparameters of global statistical distributions with featureslearned on portions of the signal. Models learned on dif-ferent classes are finally compared using a distance metricto the model learned from an unlabelled signal, which istypically assigned to the nearest class.

1.1 Traditional Algorithms for Feature Transform

Two of the main feature transform techniques include prin-cipal component analysis (PCA) [10] and Fisher’s lineardiscriminant analysis (LDA) [5].

J Sign Process Syst

1.1.1 PCA

Let{xm ∈ R

N}M

m=1 be a set of vectors containing featuresextracted from M training signals. The goal of PCA is to

learn an orthonormal set of basis functions{φk ∈ R

N}N

k=1such that

∣∣∣∣φk

∣∣∣∣2 = 1 and 〈φi ,φj 〉 = 0 ∀i �= j that are

placed along the columns of a so-called dictionary � ∈R

N×N . The bases are optimised from the data to identifytheir principal components, that is, the sub-spaces that retainthe maximum variance of the features.

To compute the dictionary, the eigenvalue decompositionof the outer product

XXT = Q�QT (1)

is first calculated, where X contains the features xm alongits columns. Then, the L eigenvectors corresponding to theL largest eigenvalues are selected from the matrix Q, andscaled to unit �2 norm to form the dictionary � ∈ R

N×N .A new set of transformed features yPCA = ��T x is com-puted by projecting the data onto the sub-space spanned bythe columns of � (that is, onto the L-dimensional princi-pal sub-space). This operation reduces the dimensionalityof the features by projecting them onto a linear subspaceembedded in R

N . It is an unsupervised technique that doesnot exploit knowledge about the classes associated with thetraining set, but implicitly relies in the assumption that theprincipal component directions encode relevant differencesbetween classes.

1.1.2 LDA

In contrast, LDA is a supervised method for feature trans-form whose objective is to explicitly maximise the separa-bility of classes in the transformed domain.

Let �p be a set indexing features extracted from databelonging to the p-th class, and

∣∣�p

∣∣ be its cardinality.Let

xpdef= 1

∣∣�p

∣∣∑

m∈�p

xm (2)

be the p-th class feature centroid, and xdef= ∑M

m=1 xm thecentroid of the features extracted from the entire trainingdataset. The between-classes scatter matrix

Sbdef=

P∑

p=1

∣∣�p

∣∣ (xp − x) (xp − x

)T (3)

is defined to measure the mutual distances between the cen-troids of different classes, while the within-classes scattermatrix

Swdef=

P∑

p=1

m∈�p

(xm − xp

) (xm − xp

)T (4)

quantifies the distances between features belonging to thesame class. Let W be a linear transform matrix. To max-imise an objective function J (W )

def=∣∣WT SbW

∣∣∣∣WT SwW

∣∣ that pro-

motes features belonging to the same class to be near eachother and far away from features belonging to other classes,the eigenvalue decomposition of the matrix

S†wSb = Q�QT (5)

is computed, and the features x are projected onto thespace spanned by its (P − 1) eigenvectors correspondingto the largest (P − 1) eigenvalues (that is, onto a space ofdimensionality equal to the number of classes minus one).

LDA explicitly seeks to enhance the discriminativepower of features by optimising the objective J .

1.2 Supervised PCA

Related works that extend PCA include the supervised PCA

(S-PCA) proposed by Barshan et al. [4]. S-PCA is based onthe theory of reproducing kernel Hilbert spaces (RKHS) (thatare spaces of functions which satisfy certain properties andmap elements from an arbitrary set to the set of complexnumbers) [1], and on the so-called Hilbert-Schmidt inde-pendence criterion (HSIC)[8]. The HSIC is used to estimatethe statistical dependence of two random variables basedon the fact that this quantity is related to the correlation offunctions belonging to their respective RKHS. While HSIC

is defined in terms of the probability density function ofthe two random variables, empirical estimates of HSIC canbe obtained from finite sequences of their realisations. Theempirical HSIC can be used in turn to construct an objectivefunction that maximises the dependence between the twovariables. Hence, this strategy is adopted within the con-text of classification to maximise the statistical dependencebetween a transformed feature yS-PCA and its correspondingclass c.

In practice, S-PCA differs from PCA in that it calcu-lates the eigenvalue decomposition of a matrix R defined asfollows:

Rdef= XHLHXT (6)

were Hdef= I − eeT is a so-called centring matrix1 and

Ldef= ccT is the kernel matrix of the class variable that is

constructed by computing the outer product of the vectorsresulting from assigning different numerical values to eachclass.

1Here e is a vector of ones.

J Sign Process Syst

1.3 Other Related Work

The union of incoherent sub-spaces model proposed bySchnass and Vandergheynst [12] employs a very similarintuition to the one that inspired our proposed method,and models features belonging to different classes usingincoherent subspaces. Other methods for supervised dimen-sionality reduction include metric learning algorithms [14],sufficient dimensionality reduction [9] and Bair’s super-vised principal components [2].

Manifold learning techniques are used to model non-linear data and reviewed by Van Der Maaten et al. [13].Finally, the sparse sub-space clustering technique developedby Elhamifar and Vidal [7] is aimed at identifying vec-tors that belong to an union of sub-spaces, and hence applyconcepts form sparse approximation to clustering.

1.4 Paper Organisation

The method proposed in this paper is aimed at learning dis-criminative sub-spaces that allow dimensionality reduction,while at the same time enhancing the separability betweenclasses. It is derived from our previous work on learningincoherent dictionaries for sparse approximation [3].

The incoherent dictionary learning problem will be intro-duced in Section 2, while Section 3 will contain the maincontribution of this paper consisting in learning incoher-ent subspaces for classification. Numerical experimentsare presented in Section 4, and conclusions are drawn inSection 5.

2 Incoherent Dictionary Learning

A sparse approximation of a signal x ∈ RN is a linear com-

bination of K ≥ N basis functions{φk ∈ R

N}K

k=1 calledatoms described by:

x ≈ x =K∑

k=1

αkφk (7)

where the vector of coefficients α contains a small numberof non-zero components, corresponding to a small num-ber of atoms actively contributing to the approximationx. Given a signal x and a dictionary, various algorithmshave been proposed to find a sparse approximation thatminimises the residual error ||x − x||2 [6].

Dictionary learning aims at optimising a dictionary � forsparse approximation given a set of training data. It is anunsupervised technique that can be thought as being a gen-eralisation of PCA, as both methods learn linear subspacesthat minimise the approximation error of the signals. Dic-tionary learning, however, is generally more flexible than

PCA because it can be employed to learn more generalnon-orthogonal over-complete dictionaries [11].

2.1 The Incoherent Dictionary Learning Problem

Dictionaries for sparse approximation have important intrin-sic properties that describe the relations between theiratoms, like the mutual coherence μ(�) = max

i �=j〈φi ,φj 〉 that

is defined as the maximum inner product between any twodifferent atoms. The goal of incoherent dictionary learningis to learn atoms that are well adapted to sparsely approxi-mate a set of training signals, and that are at the same timemutually incoherent [3].

Given a set of M training signals contained in thecolumns of the matrix X ∈ R

N×M , the incoherent dictio-nary learning problem can be expressed as:

��,A� = arg min�,A

||X −�A||F (8)

such that μ(�) ≤ μ0

||αm||0 ≤ S ∀m

where μ0 is a fixed mutual coherence constraint, the �0

pseudo-norm ||·||0 counts the number of non-zero compo-nents of its argument, S is a fixed number of active atoms,and αm denotes the m-th column of A. Setting a spe-cific value of the parameter S indicates that signals liveon (or near) a union of subspaces of dimension S, andhence depends on the type of observed signals and on thenumber of atoms K (a large number of atoms that coverdifferent directions in an N-dimensional space is likely towiden the set of signals that may be approximated usingsuch union of sub-spaces). To obtain a dictionary with min-imal mutual coherence, the parameter μ0 can be set to thelower bound on the mutual coherence of a N × K dictio-nary μ0 = √

(K − N)/N(k − 1) [3]. This implies that lesscoherent dictionaries necessarily have a smaller number ofatoms, determining a tradeoff between the parameters S andμ0 which ultimately needs to be investigated according tothe problem at hand.

Algorithms for (incoherent) dictionary learning gener-ally follow an alternate optimisation heuristic, iterativelyupdating � and A until a stopping criterion is met. Inthe case of the iterative projections and rotations algorithm(IPR) algorithm [3], a dictionary de-correlation step is addedafter updating the dictionary in order to satisfy the mutualcoherence constraint.

Given X, fixed μ0, S and a stopping criterion (such as amaximum number of iterations), the optimisation of Eq. 8 istackled by iteratively performing the following steps:

• Sparse coding: fix � and compute the matrix A using asuitable sparse approximation method.

J Sign Process Syst

• Dictionary update: fix A and update � using a suitablemethod for dictionary learning.

• Dictionary de-correlation: given X, � and A update thedictionary � to reduce its mutual coherence under thelevel μ0.

2.2 The Iterative Projections And Rotations Algorithm

The IPR algorithm has been proposed in order to solvethe dictionary de-correlation step, while ensuring that theupdated dictionary provides a sparse approximation withlow residual norm, as indicated by the objective functionEq. 8 [3].

The IPR algorithm requires the calculation of the Grammatrix G = �T � which contains the inner productsbetween any two atoms in the dictionary. G is iterativelyprojected onto two constraint sets, namely the structuralconstraint set Kμ0 and the spectral constraint set F . Theformer is the set of symmetric square matrices with unitdiagonal values and off-diagonal values with magnitudesmaller or equal than μ0:

Kμ0

def={K ∈ R

K×K : K = KT , ki,i = 1, maxi>j

|ki,j | ≤ μ0

}.

The latter is the set of symmetric positive semidefinitesquare matrices with rank smaller than or equal to N :

F def={F ∈ R

K×K : F = F T , eig(F ) ≥ 0, rank(F ) ≤ N}

where the operator eig(·) returns the vector of eigenvaluesof its argument.

Starting from the Gram matrix of an initial dictionary�, the IPR method iteratively performs the following opera-tions.

• Projection onto the structural constraint set. The pro-jection K = PKμ0

(G) can be obtained by:

1. setting ki,i = 1,2. limiting the off-diagonal elements so that, for i �=

j ,

ki,j = Limit(gi,j , μ0) ={

gi,j if |gi,j | ≤ μ0

sgn(gi,j )μ0 if |gi,j | > μ0

(9)

• Projection onto the spectral constraint set and factor-ization. The projection F = PF (G) and subsequentfactorisation are obtained by:

1. calculating the eigenvalue decomposition (EVD)G = Q�QT ,

2. thresholding the eigenvalues by keeping only the N

largest positive ones.

[Thresh(�, N)]i,i ={

λi,i if i ≤ N and λi,i > 00 if i > N or λi,i ≤ 0

where the eigenvalues in � are ordered from thelargest to the smallest. Following this step, at mostN eigenvalues of the Gram matrix are differentfrom zero,

3. factorizing the projected Gram matrix into theproduct G = �T � by setting:

� = �1/2

QT (10)

where � ∈ RN×K is the eigenvalues matrix

restricted to the first N rows.

• Dictionary rotation. Rotate the dictionary � to align itto the training set by solving the problem:

W � = arg minWWT =I

||X −W�A||F . (11)

The optimal rotation matrix can be calculated by:

1. computing the sample covariance between theobserved signals and their approximations C

def=(�A)XT ,

2. calculating the SVD of the covariance C = U�V T ,3. setting the optimal rotation matrix to W � = VUT ,4. rotating the dictionary � ← W ��.

More details about the IPR algorithm can be found in [3],including details of its computational cost.

3 Learning Incoherent Subspaces

The IPR algorithm learns a dictionary where all the atomsare mutually incoherent. Therefore, given any two disjointsets �

⋂� = ∅ that identify non-overlapping collections

of atoms, the sub-dictionaries ��,�� are also mutuallyincoherent.

Starting from this observation, the main intuition driv-ing the development of a supervised IPR (S-IPR) algo-rithm for classification is to learn mutually incoherent sub-dictionaries that approximate features from different classesof signals. The sub-dictionaries are in turn used to defineincoherent sub-spaces, and features are projected ontothese sub-spaces yielding discriminative dimensionalityreduction.

3.1 The Supervised IPR Algorithm

Let {Cm ∈ C}Mm=1 , C = {C1, C2, . . . , Cp

}be a set of

labels that identify the class of the vectors of features xm,whose elements belong to a set C of P possible classes. Thecolumns of the matrix Xp contain a selection of the featuresextracted from signals belonging to the p-th class.

J Sign Process Syst

To learn incoherent sub-dictionaries from the entire set offeatures, we must first cluster the atoms to different classes2,and then only proceed with their de-correlation if they areassigned to different classes (while allowing coherent atomsto approximate features from the same class). To this aim,we employ the matrix A to measure the contribution ofevery atom to the approximation of features belonging toeach class.

Let αkp indicate the k-th row of the matrix Ap containing

the coefficients that contribute to the approximation of Xp,and Np indicate the number of signals belonging to the p-thclass. A coefficient γk,p is defined as:

γk,pdef= 1

Np

∣∣∣∣∣∣αk

p

∣∣∣∣∣∣1, (12)

and every atom φk is associated with the class to whichit maximally contributes p�

k = arg maxp

{γk,p

}. Figure 1

depicts the structure of the matrices involved in Eq. 8,highlighting the vector used to calculate the coefficientsγk,p.

Grouping together atoms that have been assigned to thesame class leads to a set of sub-dictionaries whose sizeand rank depends on the number of atoms for each class,and to their linear dependence. As a general heuristic, iffeatures corresponding to different classes do not occupythe same sub-space (according to the active elements inA), a full-rank dictionary � with K ≥ N P ensuresthat p�

k identify P non-empty and disjoint sub-dictionaries{�p

}P

p=1.Once the atoms have been clustered, the Gram matrix

G is computed and iteratively projected as in the methoddescribed in Section 2.2, with the difference that Eq. 9 ismodified in order to only constrain the mutual coherencebetween atoms assigned to different classes

Limit(gi,j , μ0,p�)

={

gi,j if |gi,j | ≤ μ0 or p�i = p�

j

sgn(gi,j )μ0 if |gi,j | > μ0 and p�i �= p�

j

(13)

A further modification of the standard IPR algorithm pre-sented in [3] consists in the update of the Gram matrix,performed by computing its element-wise average with theprojection K = PKμ0

(G) (rather than by using the pro-jection alone). This heuristic has led to improved empiricalresults by preventing G from changing too abruptly.

The complete supervised S-IPR method is summarisedin Algorithm 1. Note that the mutual coherence μp�(�) =arg maxp�

i �=p�j

〈φi ,φj 〉 indicated in this algorithm measures the

2Note that the term cluster implies that a this stage the algorithm needsto make an unsupervised decision, since there is no any a-priori reasonto assign a given atom to any particular class.

inner product between any two atoms assigned to differentclasses since atoms assigned to the same class are allowedto be mutually coherent.

3.2 Classification via Incoherent Subspaces

The S-IPR algorithm allows to learn a set of sub-dictionaries{�p

}that contain mutually incoherent atoms. These cannot

be directly used to define discriminative subspaces because,depending on N and on the rank of each sub-dictionary,atoms belonging to disjoint sub-dictionaries might spanidentical subspaces. Instead, we fix a rank Q ≤ �N/P � (i.e.,the integer part of the ratio N / P) and choose a collection ofQ linearly independent atoms from each sub-dictionary �p,using the largest values of γk,p to define a picking order.

Thus, we obtain a set{�p

}P

p=1 of incoherent sub-spaces of

rank Q embedded in the space RN , and use them to derive

a feature transform for classification.Each feature vector xm that belongs to the class cm

is projected onto the relative subspace, yielding a set oftransformed features

{ym

}M

m=1.

ym = �cm�†cmxm (14)

where �† denotes the Moore-Penrose pseudo-inverse of thematrix � and needs to be used in place of the transposi-tion operator because the columns of � are in general notorthogonal.

J Sign Process Syst

Figure 1 Sketch of the matrix factorisation introduced in Eq. 8. The features Xp belonging to class p are approximated using the coefficients Ap .The vector αk

p contains the coefficients that determine the contribution of the atom φk to the approximation of features belonging to the p-th class.

When an unlabelled signal is presented to the classi-fier, the corresponding vector of features x is projectedonto all the learned sub-spaces. Then, the nearest sub-space is chosen using an Euclidean distance measure, andthe corresponding projection y used as the transformedfeature.

p� = arg minp

∣∣∣∣∣∣x −�p�

†px

∣∣∣∣∣∣2

(15)

y = �p��†p�x (16)

The subspace p� can be directly used as an estimator ofthe class of the signal c�. Alternatively, a simple k-neaerstneighbour classifier can be employed on the transformedfeatures, and a class can be inferred as:

c� = knn(y,Y , c) (17)

where Y represents the matrix of training features after thetransform stage. This latter approach is especially suitablewhen working with a large number of classes in a space ofrelatively small dimension, as in this case multiple classesmight be assigned to the same subspace.

4 Numerical Experiments

4.1 Feature visualisation

To illustrate the S-IPR algorithm for feature transform, wefirst run visualisation experiments depicting how differentfeature transform methods act on training and test data3.

4.1.1 Synthetic Data

Figure 2 displays a total of 1500 synthetic features in R2

belonging to 3 different classes that we generated for thisexperiment. For each class, first we draw values distributeduniformly in the interval [−1, 1] and assign them to thefirst component of the features (the x coordinate). Then, weadd Gaussian noise with variance 0.1 to the second com-ponent (the y coordinate), and we rotate the resulting data

3The Matlab code used to generate the results in this Section isavailable from https://github.com/danieleb/2014-SJSPS

by the angles θ0 = 0, θ1 = π/4 and θ3 = π/2 for the3 classes respectively. This way, features belonging to dif-ferent classes are clustered along different one-dimensionalsub-spaces of R

2.Figure 3 displays the result of the application of feature

transforms to the data depicted in Fig. 2 using subspacesof dimension 1 (with the exception of LDA that projects thedata onto a space of dimension P − 1 = 2). To generatethe plots, we divided the data into a training set (displayedusing the ‘+’ marker) and a test set (displayed using the‘o’ marker). Samples were drawn in random order from thedataset and assigned to either the training set or the testset, with the former containing 70% of the total data andthe latter containing the remaining 30%. Then, we appliedfeature transforms on the training set, thereby learning thetransform operators, and applied them to the test set.

Starting from the top-left plot, we can observe that PCAidentified the direction x = y as the one-dimensional sub-space that contains most of the variance of the training set.However, given the type of dataset and the dimensional-ity reduction caused by PCA, features from all classes areoverlapping, making this transform a poor choice for classi-fication. Similar observations can be drawn from analysingthe result of S-PCA, although this transform identifies thedirection y = 0 as the one that leads to statistical depen-dence between the value of the transformed features in the

Figure 2 Synthetic data generated along one-dimensional subspacesof R

2.

J Sign Process Syst

Figure 3 Feature transformapplied to the synthetic data inFig. 2. Different colourscorrespond to different classes,‘+’ and ‘o’ markers representsamples taken from the trainingand test set respectively.

training set and the relative class. LDA does not introduceany dimensionality reduction in this case, as it projects thefeatures onto a space of dimension P − 1 = 2, leaving theoriginal features unaltered. However, in the LDA plot we canappreciate the separation between training set and test setthat is difficult to notice in the other plots.

Finally, the plot at the right-bottom corner of Fig. 3displays the results of the S-IPR algorithm. In setting theparameters of S-IPR, we chose a 2 times over-completedictionary, a number of active atoms equal to half the dimen-sion of the data, and minimal mutual coherence. In the caseconsidered here, this means K = 4, S = 1 and μ =√

(K − N)/N(K − 1) ≈ 0.33. As discussed in Section3.1, S-IPR does not project whole sets of features onto aunique sub-space, but rather learns one sub-space for eachclass, and projects features onto the nearest sub-space. Theresult depicted here shows that three directions were identi-fied containing data from mostly one class each. Since theincoherent dictionary learning is designed to learn atomswith minimal mutual coherence, the angles between thedirections of the sub-spaces learned by S-IPR are approxi-mately equal. Prior information regarding the directions ofthe data would allow to relax the parameter μ, and trackmore closely the directions of the three data classes.

4.1.2 Iris Dataset

Figure 4 displays a subset of the ‘iris’ dataset, a populardatabase that has been used extensively to test and bench-mark classification algorithms. The original dataset contains

measurements of the sepal length, sepal width, petal lengthand petal width of three species of iris, namely ‘setosa’,‘versicolor’ and ‘virginica’. In this visualisation experimentwe selected the first 3 features to be able to depict the datausing three dimensional scatter plots. From observing thedistribution of the data in the feature space, we see that‘setosa’ is relatively separated from the other two classes,while the features relative to ‘virginica’ and ‘versicolor’

Figure 4 First three features of the ‘iris’ dataset depicting mea-surements of sepal length, sepal width and petal length of three irisspecies.

J Sign Process Syst

Figure 5 Feature transformapplied to the iris data in Fig. 4.Different colours correspond todifferent classes, ‘+’ and ‘o’markers represent samples takenfrom the training and test setrespectively.

substantially overlap, with only a few exemplars of ‘vir-ginica’ being distinguishable due to large sepal length andpetal length.

The results of feature transforms are depicted in Fig. 5.This time, we learn 2 dimensional subspaces from the 3dimensional data points and plot the transformed features,along with the learned planes. We observe that PCA iden-tifies a direction along a diagonal axis that follows thedistribution of features displayed in Fig. 4. S-PCA , on theother hand, projects the features onto a horizontal planethat slightly enhances the separation between ‘versicolor’and ‘virginica’ samples. LDA results in a projection wherefeatures belonging to the same class are closely clusteredtogether, but fails to separate the classes ‘versicolor’ and‘virginica’. Finally, the output of S-IPR displays three dis-tinct sub-spaces associated with the three classes. As in theother plots, the separation between ‘versicolor’ and ‘vir-ginica’ is far from perfect, however features from the twoclasses are mostly projected onto the respective sub-spaces.Features belonging to the ‘setosa’ class are mostly clusteredtogether as a result of their projection onto the black sub-space, however we can note a few test samples that havebeen associated by the algorithm to the blue sub-space.

4.2 Classification

In the previous section, we have illustrated how the S-IPR

algorithm is able to learn incoherent sub-spaces that modelthe distribution of features belonging to different classes.Here we evaluate S-IPR and the other feature transformalgorithms in the context of supervised classification. Toperform the classification, features are transformed usingthe methods already used for comparison in Section 4.1

Table 1 Dataset used in the classification evaluation of featuretransform algorithms.

Name N P M

Iris 4 3 150

Balance 4 3 625

Parkinsons 23 2 197

Sonar 60 2 208

USPS 256 3 1405

All the datasets can be downloaded from http://archive.ics.uci.edu/ml/datasets.html. Note that we only use a subset of the USPSdataset containing the digits 1, 3 and 8.

J Sign Process Syst

Figure 6 Misclassification ratio as a function of the rank ofthe subspace employed during feature transforms for the datasets‘iris’,‘balance’,‘Parkinsons’, ‘sonar’ and ‘USPS’.

by learning a transform operator on the training set andapplying it to the test set. We use a 5-fold stratified cross-validation to classify all the features in a dataset during thetest stage. This method produces 5 independent classifica-tion problems with a ratio between the number of trainingand test samples equal to 8 : 2. Once the features have beentransformed, a k-nearest neighbour classifier with k = 5 isused to estimate a class.

We employ the datasets detailed in Table 1, and foreach of them we evaluate the misclassification ratio, that isdefined as the fraction of misclassified samples as a propor-tion of the total number of samples in the test set, averagedover the 5 independent classification problems created bythe stratified cross-validation protocol.

Figure 6 displays for each dataset the misclassificationratio as a function of the rank of the subspace learned bythe algorithms. In the plots ‘none’ indicates that no fea-ture transform was applied (hence resulting in a sub-spacerank equal to the dimension of the original features). Ingeneral we can see that S-IPR does not perform as well asthe other techniques, and is only comparable at high ranksthat do not achieve an overall better classification ratio.Starting from the ‘iris’ dataset, LDA achieves the best per-formance followed by one-dimensional subspaces learnedusing PCA. Both S-PCA and S-IPR work better when learn-ing subspaces of high rank. Note that, at rank N = 4 allthe methods are equivalent because they are not performingdimensionality reduction. The results relative to the balancedataset are similar, with again LDA achieving the best mis-classification ratio. Although the results on the ‘Parkinsons’and ‘sonar’ datasets present similar trends regarding S-IPR,here LDA does not prove to be as successful as PCA andS-PCA in separating features belonging to different classes.Finally, for the ‘USPS’ digits dataset, it appears that alow-rank PCA is sufficient to reach the best classificationperformance.

5 Conclusion

5.1 Summary

We have presented the S-IPR algorithm for learning incoher-ent subspaces from data belonging to different classes. Theencouraging experimental results obtained on the visualisa-tion of the synthetic dataset and of a subset of features takenfrom the ’iris‘ dataset motivated us to test S-IPR as a gen-eral method for feature transform to be used in classificationproblems. Unfortunately, we found that the performanceof our proposed method on a group of datasets commonlyused to benchmark classification algorithms is only com-petitive compared to traditional and state-of-the-art methodsfor feature transform at high sub-space ranks.

J Sign Process Syst

The negative results presented in Section 4.2 do not implythat S-IPR is completely unsuitable as a tool for modellingdata for classification, but they rather open a few importantareas of future research that should be investigated to bet-ter understand the strengths and limitations of the proposedmethod.

5.2 Future work

The main assumption made when using incoherent dic-tionary learning for classification is that high dimensionalfeatures are arranged onto lower-dimensional sub-spaces,and that features belonging to different classes can be mod-elled using different subspaces that are mutually incoherent.This assumption might be met by some datasets, but mightnot generally be satisfied by others. Understanding the gen-eral distribution of the features in a dataset might be anecessary first step to inform a subsequent choice of algo-rithm, so that S-IPR can be used in cases where its premiseabout the feature distribution is valid. This same argumentholds for the whole class of linear models that comprises thedictionary learning model. Indeed, many feature transformtechniques have equivalent kernelized versions to modelnon-linear data.

Other substantial improvements can be made on the algo-rithm itself. The present implementation of S-IPR contains afixed parameter μ that promotes minimal mutual coherencebetween the sub-spaces used to approximate different dataclasses. Knowledge about the distribution of the featuresmight lead to relaxing this parameter, learning sub-spacesthat are closer to the true distribution of the features and inturn improving class separation. Moreover, different valuesof mutual coherence for different pairs of subspaces can beeasily included in the optimisation, greatly enhancing theflexibility of S-IPR as a modelling tool.

References

1. Aronszajn, N. (1950). Theory of reproducing kernels. Transac-tions of the American Mathematical Society, 68(3), 337–404.

2. Bair, E., Paul, D., Tibshirani, R. (2006). Prediction by super-vised principal components. Journal of the American StatisticalAssociation, 101, 119–137.

3. Barchiesi, D., & Plumbley, M.D. (2013). Learning incoherent dic-tionaries for sparse approximation using iterative projections androtations. IEEE Transactions on Signal Processing, 61(8), 2055–2065.

4. Barshan, E., Ghodsi, A., Azimifar, Z., Jahromi, M.Z. (2011).Supervised principal component analysis: Visualization, classi-fication and regression on subspaces and submanifolds. PatternRecognition, 44(7), 1357–1371.

5. Duda, R., & Hart, P.E. (1973). Pattern classification and sceneanalysis. New York: Wiley.

6. Elad, M. (2010). Sparse and redundant representations. NewYork: Springer.

7. Elhamifar, E., & Vidal, R. (2013). Sparse subspace clustering:algorithm, theory, and applications. To appear in IEEE transac-tions on pattern analysis and machine intelligence.

8. Gretton, A., Bousquet, O., Smola, A., Scholkopf, B. (2005).Measuring statistical dependence with hilbert-schmidt norms. InAlgorithmic learning theory (pp. 63–77). New York: Springer.

9. Li, K.-C. (1991). Sliced inverse regression for dimension reduc-tion. Journal of the American Statistical Association, 86(414),316–327.

10. Pearson, K. (1901). On lines and planes of closest fit to systems ofpoints in space. The London, Edinburgh and Dublin PhilosophicalMagazine and Journal of Science, Sixth Series, 2, 559–572.

11. Rubinstein, R., Bruckstein, A., Elad, M. (2010). Dictionaries forsparse representation modeling. Proceedings of the IEEE, 98(6),1045–1057.

12. Schnass, K., & Vandergheynst, P. (2010). A union of incoherentspaces model for classification. In Proceedings of the IEEE inter-national conference on acoustics, speech and signal processing(ICASSP) (pp. 5490–5493).

13. Van Der Maaten, L., Postma, E., Van Den Herik, J. (2009). Dimen-sionality reduction: A comparative review. Tech. Rep., TiCC,Tilburg University.

14. Xing, E.P., Jordan, M.I., Russell, S., Ng, A. (2002). Distance met-ric learning with application to clustering with side-information.In Advances in neural information processing systems (pp. 505–512).

Daniele Barchiesi([email protected]) re-ceived an MSc by researchand a PhD degree from theCentre for Digital Music atQueen Mary University ofLondon in 2009 and 2013respectively. His researchinterests include audio engi-neering, signal processing andmachine learning, with a par-ticular focus on the fields ofdictionary learning for sparseapproximation, acoustic sceneclassification and intelligentmixing. Since January 2014,

Dr. Barchiesi is a research associate in Data Science at University Col-lege London, where he is part of a research project with the WarwickBusiness School on ‘Big Data’ and computational social science.

J Sign Process Syst

Mark D. Plumbley([email protected])received the B.A. (honors)degree in electrical sciencesand the Ph.D. degree in neuralnetworks from the Univer-sity of Cambridge, UnitedKingdom, in 1984 and 1991,respectively. From 1991 to2001, he was a lecturer atKing’s College London. Hemoved to Queen Mary Uni-versity of London in 2002,where he is now Director ofthe Centre for Digital Music.His research focuses on the

automatic analysis of music and other sounds, including automaticmusic transcription, beat tracking, and acoustic scene analysis, usingmethods such as source separation and sparse representations. He isa past chair of the ICA Steering Committee and is a member of theIEEE Signal Processing Society Technical Committee on Audio andAcoustic Signal Processing. He is a Senior Member of the IEEE.