Classification of Multichannel Signals With Cumulant-Based Kernels

24
1 Classification of Multichannel Signals with Cumulant-based Kernels Marco Signoretto, Emanuele Olivetti, Lieven De Lathauwer and Johan A. K. Suykens Abstract We consider the problem of training a classifier given a set of labelled multivariate time series (a.k.a. multichannel signals or vector processes). The approach is based on a novel kernel function that exploits the spectral information of tensors of fourth order cross-cumulants associated to each multichannel signal. This kernel is used in combination with a primal-dual technique to define a discriminative classifier which is entirely data-driven. Nonetheless, insightful connections with the dynamics of the generating systems can be drawn under specific modeling assumptions. The procedure is illustrated on both a synthetic example as well as on a brain decoding task where the direction, either left of right, towards where the subject modulates attention is predicted from magnetoencephalography (MEG) signals. Kernel functions that are widely used in machine learning do not exploit the structure inherited by the underlying dynamical nature of the data. A comparison with these kernels as well as with a state-of-the-art approach shows the advantage of the proposed technique. Index Terms Multiple signal classification, Higher order statistics, Statistical learning, Kernel methods, Brain computer interfaces. I. I NTRODUCTION This work is concerned with binary classification of multivariate time series. This task clearly deviates from the standard pattern recognition framework where one deals with a set of measured attributes that Marco Signoretto and Johan A. K. Suykens are with the Katholieke Universiteit Leuven, Department of Electrical Engineering, ESAT, Kasteelpark Arenberg 10, B3001 Leuven (Heverlee), Belgium, tel:+32-(0)16320362, fax:+32-(0)16321970, (e-mail:{marco.signoretto,johan.suykens}@esat.kuleuven.be). Emanuele Olivetti is with the NeuroInformatics Laboratory (NILab), Bruno Kessler Foundation and University of Trento, Trento, Italy (e-mail: [email protected]). Lieven De Lathauwer is with the Group Science, Engineering and Technology of the Katholieke Universiteit Leu- ven Campus Kortrijk, E. Sabbelaan 53, 8500 Kortrijk, Belgium, tel: +32-(0)56326062, fax: +32-(0)56246999, (e-mail: [email protected]). February 21, 2011 DRAFT

Transcript of Classification of Multichannel Signals With Cumulant-Based Kernels

1

Classification of Multichannel Signals with

Cumulant-based KernelsMarco Signoretto, Emanuele Olivetti, Lieven De Lathauwer and Johan A. K. Suykens

Abstract

We consider the problem of training a classifier given a set oflabelled multivariate time series (a.k.a.

multichannel signals or vector processes). The approach isbased on a novel kernel function that exploits

the spectral information of tensors of fourth order cross-cumulants associated to each multichannel signal.

This kernel is used in combination with a primal-dual technique to define a discriminative classifier which

is entirely data-driven. Nonetheless, insightful connections with the dynamics of the generating systems

can be drawn under specific modeling assumptions. The procedure is illustrated on both a synthetic

example as well as on a brain decoding task where the direction, either left of right, towards where the

subject modulates attention is predicted from magnetoencephalography (MEG) signals. Kernel functions

that are widely used in machine learning do not exploit the structure inherited by the underlying dynamical

nature of the data. A comparison with these kernels as well aswith a state-of-the-art approach shows

the advantage of the proposed technique.

Index Terms

Multiple signal classification, Higher order statistics, Statistical learning, Kernel methods, Brain

computer interfaces.

I. INTRODUCTION

This work is concerned with binary classification of multivariate time series. This task clearly deviates

from the standard pattern recognition framework where one deals with a set of measured attributes that

Marco Signoretto and Johan A. K. Suykens are with the Katholieke Universiteit Leuven, Department of ElectricalEngineering, ESAT, Kasteelpark Arenberg 10, B3001 Leuven (Heverlee), Belgium, tel:+32-(0)16320362, fax:+32-(0)16321970,(e-mail:marco.signoretto,[email protected]).

Emanuele Olivetti is with the NeuroInformatics Laboratory(NILab), Bruno Kessler Foundation and University of Trento,Trento, Italy (e-mail: [email protected]).

Lieven De Lathauwer is with the Group Science, Engineering and Technology of the Katholieke Universiteit Leu-ven Campus Kortrijk, E. Sabbelaan 53, 8500 Kortrijk, Belgium, tel: +32-(0)56326062, fax: +32-(0)56246999, (e-mail:[email protected]).

February 21, 2011 DRAFT

2

do not change over time. A possible approach is based on the construction of metafeatures that exploit

the additional structure inherited by the dependency over time [25] and then apply standard pattern

recognition techniques on the derived set of attributes. Anappealing class of tailored methodologies

is based on probabilistic models, such as Hidden Markov models (HMMs), that have found successful

applications in areas like speech, handwriting and gesturerecognition.

In this paper we investigate a kernel-based discriminativemethodology for the classification of mul-

tivariate signals. The idea makes use of higher order statistics to build a suitable similarity measure

starting from the measured signals. In this respect the kernel-based approach in [3] differs since it is

based on a two step procedure that first performs the blind identification of the generating systems

(one for each multivariate time series) and successively carries out the classification by endowing the

space of such models with a distance measure. Similarly the family of kernels found in [48] requires

again the knowledge of a model and hence demands for a foregoing identification step. Notice that this

step poses many challenges. In fact most of the identification techniques involve multistage iterative

procedures that can suffer from the problem of multiple local minima. The approach proposed in [3], for

example, requires to first estimate a set of candidate models. This amounts at 1) determining empirically

the dimension of the state process 2) finding the system matrices and finally 3) solving the Riccati

equation to obtain a set of input-related matrices that matches the statistical properties of the output

process. Successively for each candidate model its inverseis used to deconvolve the observed output

and recover the input process and the initial state. Remarkably, this deconvolution step is a nontrivial

process given that the generating systems might have non-minimal phase (as observed for the case of

the human motion [3]). An independence criterion is then used to select a suitable model in the bag of

candidates. After an additional inference step is taken, one gets to the parametric representation of the

underlying dynamical system and can finally compute a system-based kernel to measure the similarity

between the estimated models. On the contrary, the approachthat we propose does not require postulating

any specific model class nor the explicit identification of systems. In fact, the idea can be seen as entirely

data driven. Nonetheless, if a certain structure can be assumed interesting properties arise. In Section

IV-C, for instance, we draw the link between the proposed similarity measure and the latent state process

in the case of multi-input multi-output (MIMO) linear and time-invariant state-space models driven by

a non-Gaussian white process. This might provide useful insights on certain processes of interest as for

the case of encephalography (EG) signals that we consider. Another important difference with respect to

[3] is that we assume that the effect of transient due to initial conditions is negligible in the observed

signals. This is supported by the fact that in many practicalapplications the transient is discarded since

February 21, 2011 DRAFT

3

it is believed to carry information potentially misleadingfor the analysis. For instance in the context

of neuroscience, it is known that the EG signals immediatelyafter the presentation of certain stimuli

are non informative with respect to the mental processes of interest [46]. Likewise we assume that no

marginally stable modes (corresponding to poles on the unitcircle) are present and hence the steady-state

output does not present persistent oscillations1. This assumption is usually met since in many practical

applications signals are preprocessed to avoid persistentrhythmical components. In biosignal processing,

for instance, these are often the effect of artifacts such asbreathing and heart beating.

First of all we discuss a framework for the discrimination oftime-windows of multivariate signals

that extends the classical formalism of statistical learning. Secondly we introduce a kernel function that

exploits the structure inherited by the dynamical nature ofthe data. We then show properties of such a

similarity measure under specific assumptions on the generating mechanism. Finally we test the proposed

approach on a domain that recently attracted wide interest in the scientific community.

In the next Section we recall the definition of cumulants of (vector) processes and present instrumental

properties that relate time series with their generating dynamical system. In Section III we formalize the

problem of interest. Section IV presents our approach. Section V is then concerned with the application

that motivates this study, namely classification of MEG signals. Both synthetic as well as real life examples

are considered. We conclude by drawing our final remarks in Section VI.

II. T ENSORS ANDCUMULANTS

We denote scalars by lower-case letters (a, b, c, . . .), vectors as capitals (A,B,C, . . .) and matrices as

bold-face capitals (A,B,C, . . .). In particular we denoted byIN the N × N identity matrix. We also

use lower-case lettersi, j in the meaning of indices and with some abuse of notation we will use I, J

to denote the index upper bounds. Additionally we writeNI to denote the set1, . . . , I. We write ai

to mean thei−th entry of a vectorA. Similarly we writeaij to mean the entry with row indexi and

column indexj in a matrixA, i.e., aij = (A)ij .

A. Tensors

Before proceeding we recall thatN−th order tensors, which we denote by calligraphic letters (A,

B, C, . . .), are higher order generalizations of vectors (first order tensors) and matrices (second order

tensors). Scalars can be seen as tensors of order zero. We write ai1,...,iNto denote(A)i1,...,iN

. An N−th

1This corresponds to having a purely stochastic vector process: no deterministic component is present in the Wold’sdecomposition of the process [49].

February 21, 2011 DRAFT

4

order tensorA has rank-1 if it consists of the outer product ofN non-zero vectorsU (1) ∈ RI1, U (2) ∈

RI2 , . . . , U (N) ∈ R

IN , that is, if

ai1i2...iN= u

(1)i1

u(2)i2· · · u

(N)iN

(1)

for all values of the indices. The linear span of such elements form a vector space, denoted byRI1 ⊗

RI2 ⊗ · · · ⊗ R

IN , which is endowed with the inner product

〈A,B〉 :=∑

i1

i2

· · ·∑

iN

ai1i2···iNbi1i2···iN

(2)

and with the Hilbert-Frobenius norm [14]

‖A‖F :=√

〈A,A〉 . (3)

It is often convenient to rearrange the elements of a tensor so that they form a matrix. This operation is

referred to asmatricizationor unfolding.

Definition II.1 (n−mode matricization [28]). Assume anN−th order tensorA ∈ RI1 ⊗ · · · ⊗R

IN . The

n−th mode matrix unfolding, denoted asA〈n〉, is the matrix

RIn ⊗ R

J ∋ A〈n〉 : a(n)inj := ai1i2...iN

(4)

whereJ :=∏

p∈NN−1\n

Ip and for R := [n + 1, n + 2, · · · , N, 1, 2, 3, · · · , n− 1] we have2:

j = 1 +∑

l∈NN−1

(irl− 1)

p∈Nl−1

Irp

=: q(ir1, ir2

, · · · , irN−1) . (5)

We conclude this introductory part on tensors by defining theinstrumental operation of n-mode product.

Definition II.2 (n-mode product [14]). Then−mode product of a tensorA ∈ RI1 ⊗R

I2 ⊗ · · · ⊗RIN by

a matrixU ∈ RJn⊗R

In, denoted byA×n U , is a (I1× I2×· · ·× In−1×Jn× In+1×· · ·× IN )−tensor

with entries

(A×n U)i1i2···in−1jnin+1···iN:=

in∈NIn

ai1i2···in−1inin+1···iNujnin

. (6)

2Equation (5) simply defines a rule (in fact, a bijectionq) that uniquely assigns to(ir1 , ir2 , · · · , irN−1) the correspondingcolumn indexj of A〈n〉.

February 21, 2011 DRAFT

5

B. Cumulants of Scalar Processes

Cumulants are higher order statistics [34], [33] that have received interest especially for the identi-

fication of linear and non-linear systems [17], [20] as well as for blind source separation [8]. For a

real-valued discrete-time and zero-mean stationary scalar processx(t) [22], theJ−th order cumulant

function cx : ZJ−1 → R is defined in terms of the so calledcumulant generating function[35]. In

particular, the second order cumulant is

cx(l1) := E [x(t)x(t + l1)] . (7)

The third and fourth order cumulants are respectively givenby:

cx(l1, l2) := E [x(t)x(t + l1)x(t + l2)] (8)

and

cx(l1, l2, l3) := E [x(t)x(t + l1)x(t + l2)x(t + l3)]− E [x(t)x(t + l1)] E [x(t + l2)x(t + l3)]

− E [x(t)x(t + l2)] E [x(t + l3)x(t + l1)]− E [x(t)x(t + l3)] E [x(t + l1)x(t + l2)] . (9)

C. Cumulants of Vector Processes

In order to introduce higher-order statistics for a realD−variate discrete-time stationary vector process3

X(t) one needs the notion ofcross-cumulantsbetween scalar entries of the vector. IfE [X(t)] = 0

then similarly to (7)-(9) one has:

cXd1d2

(l1) := E [xd1(t)xd2

(t + l1)] (10)

which is the second order cross-cumulant between thed1−th andd2−th entries of the vector processX,

cXd1d2d3

(l1, l2) := E [xd1(t)xd2

(t + l1)xd3(t + l2)] (11)

3Sometimes we simply indicate it asX.

February 21, 2011 DRAFT

6

which is the third order cross-cumulant between thed1−th, d2−th andd3−th entries of the vector process

X and

cXd1d2d3d4

(l1, l2, l3) := E [xd1(t)xd2

(t + l1)xd3(t + l2)xd4

(t + l3)]

− E [xd1(t)xd2

(t + l1)] E [xd3(t + l2)xd4

(t + l3)]− E [xd1(t)xd3

(t + l2)] E [xd4(t + l3)xd2

(t + l1)]

− E [xd1(t)xd4

(t + l3)] E [xd2(t + l1)xd3

(t + l2)] (12)

which is the fourth order cross-cumulant between thed1−th, d2−th, d3−th andd4−th entries of the

vector processX.

D. Cumulant Tensors

We now bring in the tensor representations of cumulants thatwe consider in this paper. To begin with,

for a given finiteJ ∈ N and anyi ∈ NJ−1, let Li ⊂ Z be an arbitrary finite set with cardinalitySi and

denote bylis its s−th element. Consider now the collection of∏

i∈J−1 Si integer tuples obtained as

L := L1 ×L2 × · · · ×LJ−1 . (13)

Definition II.3 (L−Cumulant Tensor). We callL−cumulant tensor, denoted asCXL

, the

(2J − 1)−th order tensor

CXL ∈ R

D ⊗ RD ⊗ · · · ⊗ R

D

︸ ︷︷ ︸

J times

⊗RS1

⊗ RS2

⊗ · · · ⊗ RSJ−1

defined entry-wise by

(CX

L

)

d1d2···dJs1s2···sJ−1:= cX

d1d2···dJ(l1s1

, l2s2, . . . , lJ−1

sJ−1) (14)

for any tuple[d1, d2, · · · , dJ , s1, s2, · · · , sJ−1] ∈ ND ×ND · · · ×ND × NS1 × NS2 × · · · × NSJ−1 .

Example II.4. As a concrete example suppose thatX is a zero-mean4−variate (D = 4) stationary

process. TakeJ = 3 and letL1 = −3, 1 andL2 = −3, 0, 2. In this caseL = (−3,−3), (−3, 0),

(−3, 2), (1,−3), (1, 0), (1, 2), and each entry of the cumulant tensorCL , indexed by a tuple[d1, d2, d3,

s1, s2] ∈ N4 ×N4 ×N4 ×N2 ×N3, is a 3rd order cross-cumulant between thed1−th, d2−th andd3−th

entries of the vector processX.

A special case that arises from the previous definition corresponds to the situation where for anyi ∈

February 21, 2011 DRAFT

7

NJ−1, the setLi is a singleton. In this caseL is also a singleton, i.e., we haveL =

[l11, l21, . . . , lJ−11 ]

for a single fixed tuple of integers[l11, l21, . . . , lJ−11 ]. Additionally J−1 dimensions ofCX

Lare degenerate,

that is,CXL

can be equivalently represented as aJ−th order tensor.

Definition II.5 ((l1, l2, . . . , lJ−1)-Cumulant Tensor). We call (l1, l2, . . . , lJ−1)−cumulant tensor,

denoted asCX(l1, l2, . . . , lJ−1), the J−th order tensor

CX(l1, l2, . . . , lJ−1) ∈ RD ⊗ R

D ⊗ · · · ⊗ RD

︸ ︷︷ ︸

J times

defined entry-wise by

(CX(l1, l2, . . . , lJ−1)

)

d1d2···dJ:= cX

d1d2···dJ(l1, l2, . . . , lJ−1) . (15)

In the remainder of this paper we simply writecumulant tensorwhen the specific notion that we are

referring to is clear from the context. Additionally we write, respectively,CXL

andCX(l1, l2, . . . , lJ−1) to

mean the empirical (a.k.a. sample) versions of the cumulanttensors introduced in the previous definitions.

These are simply obtained by replacing the expectation operator by thetime averagein the definitions of

cross-cumulants (10)-(12) [19], [43]. For instance, suppose thatX(1),X(2), · · · ,X(T ) is a given sam-

ple from the processX. Then if we letT0 := max 1, 1 − l1, 1− l2 andT ′ := min T, T − l1, T − l2,

the empirical version of (11) is obtained as:

cXd1d2d3

(l1, l2) :=1

T ′ − T0 + 1

T ′

t=T0

xd1(t)xd2

(t + l1)xd3(t + l2) .

We also assume (as done everywhere else, at least implicitly) that sample versions of cumulants do

converge towards their theoretical counterpart forT → ∞. This is true, in particular, wheneverX is

strictly stationaryand ergodic and E[X2d (t)] < ∞ for all d ∈ ND [22, Section IV.2, Theorem 2]4.

However, as known, the property of ergodicity is not observationally verifiable from a single realization

of a time series [22].

We now link a class of time series with their generating dynamical system. This will allow us to derive

useful properties that support the use of cumulant-based representations for the task of discriminating

between classes of multichannel signals that we consider inthe next Section.

4Different is the situation of signals driven bycyclostationarysources that we do not consider here. In this situation, thatarise for instance in the context of radiocomunications, itis known that the empirical estimators arising from time averages arebiased. See, e.g., [18] and [8, Section 17.2.2.2].

February 21, 2011 DRAFT

8

E. Cumulants of Processes with State Space Representation

Suppose that, for the given stationary processX, the z-spectrum5:

S(z) :=∑

l∈Z

CX(l)z−l

is a rational function and has maximal normal rank everywhere on the complex unit circle. In this case a

fundamental result [26] shows that there areA,B ∈ RDZ ⊗R

DZ ,C ∈ RDX ⊗R

DZ and white processes6

W (input process) andV (noise process) such thatX can be interpreted as the output process of the

MIMO stable systemΣ = (A,B,C) in state-space form:

Z(t + 1) = AZ(t) + BW (t) (16)

X(t) = CZ(t) + V (t) . (17)

Conversely starting from the generating dynamical system (16)-(17) we have the following basic result.

Proposition II.1 (State-space Model: Cumulant of the Output Process). Assume a stable state space

model(16)-(17) specified as above and suppose additionally that the processesW and V are mutually

independent. Then for any tuple[l1, l2, . . . , lJ−1] ∈ ZJ−1 in steady state we have:

CX(l1, l2, . . . , lJ−1) = CZ(l1, l2, . . . , lJ−1) ×1 C ×2 C ×3 · · · ×J−1 C + CV (l1, l2, . . . , lJ−1) . (18)

Proof: In steady state the state vector results from a linear filtering of the input processW and

hence is statistically independent of the noise processV . The result now follows immediately from basic

properties of cumulants, namely multilinearity and their additivity under the independence assumption

(see e.g. [15, Sec. 2.2 properties 2 and 5]).

Results in line with Proposition II.1 appeared in related forms elsewhere. Cumulants of the output

process of a stable single-input single-output linear time-invariant system were studied in [6] in terms of

the system matrices. Cumulants of a broader class of linear MIMO systems (time-varing with possibly

non-stationary inputs) were the subject of [44]. Additionally notice that the cumulant of the state process

in (16) can also be given in terms of the system matricesA and B and the input process cumulant.

Our aim however is simply to illustrate the advantage of cumulants for discriminating between observed

5Notice thatCX(l), corresponds toE[X(t)X(t + l)⊤].6We recall that a stationary stochastic processX(t) is white if E[X(i)X⊤(j)] = Qδij where δ is the Kronecker delta

function andQ denotes some covariance matrix.

February 21, 2011 DRAFT

9

multivariate processes. To see this, recall that the cumulant of a white Gaussian process vanishes as

soon asJ > 2 [34]. Moreover, if a real random variable has symmetric distribution about the origin,

then all its odd cumulants vanish [34]. In many cases this last condition is naturally verified for the

entries of the state vector and one then considersJ ≥ 4. In fact, this choice ofJ and equation (18)

ensures that the output cumulant is solely determined by theunderlying dynamical system and the input

distribution and, in particular, is independent of the noise process. Generally speaking, it becomes harder

to estimate cumulants from sample data as the order increases, i.e. longer data sets are required to obtain

the same accuracy [4], [19], [43]. In the following we stick to J = 4, also used in the context of system

identification [31].

Finally notice that, in principle, the linearity of the model (16)-(17) is somewhat restrictive: for

general systems both the state update as well as the measurement map from the state vectorZ(t) to

the output vectorX(t) might be governed by nonlinear equations. Nonetheless it turns out that under

certain assumptions nonlinear systems have the same input-output behavior of linear systems which have

possibly a higher dimensional state process [30].

III. C LASSIFICATION OF MULTICHANNEL SIGNALS

A. General Problem Statement

It is assumed that we are given a dataset ofN input-target7 pairs

DN =(

X(n), yn

)

∈ RDX ⊗ R

T × −1, 1 : n ∈ NN

drawn in a i.i.d. manner from an unknown fixed joint distribution PXY . Specifically for a sampling

interval ∆ and t0 > 0 the n−th input pattern

X(n) =[

X(n)(1), X(n)(2), · · · , X(n)(t), · · · , X(n)(T )]

(19)

represents the evolution along the time instants∆ + t0, 2∆ + t0, · · · , t∆ + t0, · · · , T∆ + t0 of the

DX−variate processX(n). Notice that the fixed marginal distributionPX (as well as the conditional

distribution PX|Y =y on RDX ⊗ R

T ) is induced by the underlying mechanism generating the stationary

multivariate time-series. With reference to (16) an interesting modeling assumption might correspond to

the case where(A,B,C) ∼ PABC wherePABC is some probability measure on the class of state-space

7We reserve the wordoutputto refer to the output process whereas we usetarget to address the target label. To avoid confusionwe also stress that an input patternX is formed upon a time frame extracted from what corresponds to — in the standardstate-space nomenclature — the output processX (compare equation (19) with equation (16)).

February 21, 2011 DRAFT

10

models. Similarly a conditional model might be(A(y),B(y),C(y)) ∼ PABC|Y =y where y is a given

class label. We shall illustrate this idea in a non technicalmanner by means of a concrete application

in the following Subsection. Meanwhile notice that the taskof interest amounts at predicting the labely

corresponding to a test patternX by means of aclassifier

f : RDX ⊗R

T → −1, 1 . (20)

This is accomplished by finding a real-valued functionf : RDX ⊗R

T → R so that its quantized version

f = sign f is as close as possible to some minimizerh of a predefined performance criterion, such as

the expected risk based on the0− 1 loss:

h ∈ argminEXY [I(h(X) − y)] s.t. h : R

DX ⊗ RT → −1, 1

whereI is the indicator functionI(x) =

1, x 6= 0

0, otherwise .

B. The Case of EG Signals

Among others, a state-space framework for multichannel electroencephalography (EEG) signals was

proposed in [5], [7] with the purpose of modeling cortical connectivity8. In this case (16)-(17) is assumed

to describe the physics behind the cortical dynamics. Specifically the state processZ represents the latent

cortical signals. On the other hand, the output processX represents the evolution in time of the electric

potentials (the multichannel EEG signals) measured at the scalp. Equation (17) expresses them as a noisy

linear combination of the cortical signals.

Suppose now that theN training inputs refer to multichannel EG signals measured from N different

subjects on certain time windows. Each input observation islabelled according to the class to which the

corresponding subject belongs. Then one might assume that,for each class, there is an underlying nominal

system (induced by the class-specific average brain) and a subject-dependent displacement from it. For

experiments on a single subject, on the other hand, one mightassume that the subject-specific system

remains unchanged whereas what changes is the cortical activity driven by external stimuli belonging to

different categories.

IV. CUMULANT -BASED KERNEL APPROACH FORMULTICHANNEL SIGNALS

In Section II-E we have shown that a representation of multivariate processes by means of cumulants

enjoys certain favorable properties. We now exploit this representation in order to establish a similarity

8We argue that a similar state-space representation is plausible also to describe MEG signals.

February 21, 2011 DRAFT

11

measure between two arbitrary input patternsX1 and X2 formed by time windows of output vector

processes. We do this by relying on a tensorial kernel that exploits the spectral information of cumulant

tensors. We begin by recalling briefly from [38] the notion offactor kernels and that of tensorial kernel.

Successively we specialize to the case of cumulant tensors.

A. Factor Kernels and Tensorial Kernel

Let X〈p〉 andY〈p〉 denote thep−th matrix unfolding of two generic elementsX andY of the same

vector spaceRI1 ⊗RI2 ⊗ · · · ⊗R

IP . The singular value decomposition (SVD) ofX〈p〉 can be factorized

in a block-partitioned fashion as follows:

X〈p〉 =(

U(p)X ,1 U

(p)X ,2

)

S

(p)X ,1 0

0 0

V

(p)⊤X ,1

V(p)⊤X ,2

(21)

where0’s are matrices of zeros with appropriate sizes. A similar decomposition holds forY〈p〉. Let now

g(V(p)X ,1,V

(p)Y ,1) be the distance function given by:

g(

V(p)X ,1,V

(p)Y ,1

)

:=∥∥∥V

(p)X ,1V

(p)⊤X ,1 − V

(p)Y ,1V

(p)⊤Y ,1

∥∥∥

2

F. (22)

The latter is defined in terms of the subspaces spanned by the right singular vectors corresponding to

non-vanishing singular values and is known asprojection Frobenius-norm[16], [38]. It can be restated

in terms ofprincipal anglesthat, in turn, are commonly used to define a metric over lineardynamical

systems [12], [32]. We can now introduce thep−th factor kernelaccording to

kp(X ,Y) := exp

(

−1

2σ2g(V

(p)X ,1,V

(p)Y ,1)

)

(23)

whereσ is a user-defined tuning parameter. Finally we construct thepositive definitetensorial kernel

functionk as the pointwise product of the factor kernels [38]

k(X ,Y) :=∏

p∈NP

kp(X ,Y) . (24)

The latter is a measure of similarity based upon the spectralcontent of the different matrix unfoldings,

see [38] for details. Additionally we notice that the induced similarity measure can be studied from an

algebraic perspective starting from a characterization ofthe subspaces associated to the matrix unfoldings

[13].

February 21, 2011 DRAFT

12

B. Cumulant-based Kernel Functions

We consider now a specific type of tensors, i.e.,L−cumulant tensors that were introduced in Section

II-D. Notice that, since we stick to the case of fourth order cumulants, these tensors have order7.

Additionally, one hasL := L1 × L2 × L3 where, for anyi ∈ 1, 2, 3, Li is an arbitrary finite

subset ofZ. Given the general definition of tensorial kernel (24) it is now straightforward to introduce

a cumulant-based kernelkL : RDX ⊗ R

T × RDX ⊗ R

T → R according to:

kL (X1,X2) := k(

CX1

L, CX2

L

)

(25)

where the empirical versions of the cumulants are computed based upon the observed time framesX1

andX2. Notice that the freedom in constructingL gives rise to a whole family of such kernel functions.

For generalL , however, the evaluation ofkL at X1 andX2 requires computing the SVD of14 matrices

(7 for both arguments). The sizes of these matrices depend uponDX and the cardinalities ofLi, i ∈ N3.

Since in many practical applicationsDX is of the order of tens or even hundreds,kL might be prohibitive

to evaluate. As discussed in Section II-D, however, the degenerate case whereL is singleton implies a

significant reduction in the order of the tensor. In the present settingCX1

LandCX2

Lcan be equivalently

represented as4-th order tensor whenL is singleton. For this reason we consider the special case where

L = [0, 0, 0] and hence contains only the zero-lags tuple. A similar restriction is generally considered

in the context of blind source separation [18]. An additional reduction can be achieved by fixing a specific

channel, indexed byr ∈ ND, that is used as areference signal. To see this, let us defineCr4X

entry-wise

by:

(Cr4X)d1d2d3

:=(

CX(0, 0, 0))

d1d2d3r∀ [d1, d2, d3] ∈ NDX

× NDX× NDX

. (26)

Notice thatCr4X

is a third order tensor whereasCX(0, 0, 0) is a fourth order tensor. We can then define

the kernel functionkr : RDX ⊗ R

T × RDX ⊗ R

T → R according to :

kr(X,Y ) := k1(Cr4X, Cr

4Y )k2(Cr4X, Cr

4Y )k3(Cr4X, Cr

4Y ) . (27)

The procedure in Algorithm 1 shows the steps required to evaluatekr for the input patterns inDN .

The advantage of the reference signal directly impacts on the computation at line (i) and at line (ii) of

Algorithm 1. When needed an higher level of reduction can be achieved with multiple reference signals.

February 21, 2011 DRAFT

13

Algorithm 1: KERNELMATRIX (DN , σ, r)

for each n ∈ NN

comment:Precompute cumulant tensors and SVD’s

do

compute Cr4X(n)(see (26), (15) and (12)) (i)

A(n) ← Cr4X(n)

for each i ∈ N3

do

V(i)

A(n),1← SVD(A

(n)<i>) (ii)

for each n, m ∈ NN and m > n

comment:Compute entries of the kernel matrix K

do

for each i ∈ N3

do ai ← g(

V(i)

A(n),1, V

(i)

A(m),1

)

(Kr)nm ← a1a2a3

Kr ←Kr + K⊤r + IN

return (Kr)

C. Connection with the System Dynamics

The properties of cumulants recalled in the previous Section enable to bridge the gap between the

kernel constructed on the observed outputs and the dynamicsof the generating systems. In particular, we

have the followinginvariance properties.

Theorem IV.1 (Asymptotical invariances). Assume systemsΣ1 = (A1,B1,C1) and Σ2 = (A2,B2,C2)

driven by, respectively, mutually independent white ergodic processes(W1, V1) and mutually independent

white ergodic processes(W2, V2). Additionally assume thatV1 andV2 are Gaussian. LetX1 ∈ RDX⊗R

T

and X2 ∈ RDX ⊗ R

T be given realizations from time-frames of the output processes taken from the

steady-state behavior of the two systems. Then forT →∞ and anyi ∈ N3

1) if C1 = C2 the evaluation of thei−th factor can be restated as:

ki(Cr4X1

, Cr4X2

) = exp

(

−1

2σ2g

(

V(i)CZ1(0,0,0),1,V

(i)CZ2(0,0,0),1

))

(28)

and hence depends only on the fourth order cross-cumulants of the state processes.

2) if X ′1,X

′2 ∈ R

DX ⊗RT are also output realizations obtained from systemsΣ1 and Σ2 respectively,

February 21, 2011 DRAFT

14

then we have:

a) ki(Cr4X

1, Cr

4X′

2) = ki(Cr

4X1, Cr

4X2)

b) ki(Cr4X

1, Cr

4X1) = 1 and ki(Cr

4X′

2, Cr

4X2) = 1.

Proof: We indicate here byX (resp.X ′) the generic output process, eitherX1 or X2 (resp.X ′1 or

X ′2). First of all notice that, since the output processes are ergodic, for T →∞ we have:

(Cr4X

)d1d2d3→

(CX(0, 0, 0)

)

d1d2d3r∀ [d1, d2, d3] ∈ NDX

×NDX× NDX

.

In particular, this implies thatCr4X≡ Cr

4X′ from which the equations in 2.a and 2.b immediately follow.

In order to proof the first part notice now that, without loss of generality9, we can considerC to be

orthogonal. By Proposition II.1 we have:

CX(0, 0, 0) = CZ(0, 0, 0) ×1 C ×2 C ×3 · · · ×J−1 C + CV (0, 0, 0)

whereCV (0, 0, 0) vanishes by the Gaussianity assumption on the noise processes. Denote byA⊗B the

Kronecker product between any two matricesA andB. Thei−th mode unfolding(Cr4X

)<i>

corresponds

now to(CX(0, 0, 0)

)

<i>= C

(CZ(0, 0, 0)

)

<i>(C ⊗C ⊗C)⊤

where we used a basic property on the matricization operation [14], [36]. Notice thatC ⊗C ⊗C is an

orthogonal matrix and hence ifU (i)CZ(0,0,0),1S

(i)CZ(0,0,0),1V

(i)⊤CZ(0,0,0),1 is the thin SVD [21] factorization of

(CZ(0, 0, 0)

)

<i>then for the right singular vectorsV (i)

Cr4X

,1 we have

V(i)Cr4X

,1 = (C ⊗C ⊗C) V(i)CZ(0,0,0),1. (29)

Replacing (29) in the formula that giveski(Cr4X1

, Cr4X2

) and exploiting the orthogonality ofC ⊗C ⊗C

now proofs (28).

The first part of the former result is best understood in lightof the example of state-space models for

EG signals given in Section III-B. For this case it means that, although the factor kernels are defined on

the electric potentials, under the given assumptions they depend uniquely on the latent cognitive processes

(actually their cumulants). Hence each factor kernelki(Cr4X1

, Cr4X2

) (and, as a consequence, the kernel

function (27)) can be envisioned in this case as a similaritymeasure between the latent cognitive states.

This largely depends on the fact that, being defined on cumulant tensors, each factor kernel is invariant

9In fact, as well known, there is an entire class of state spacerepresentations that are equivalent up to similarity transformations.

February 21, 2011 DRAFT

15

with respect to Gaussian output noise. The second part, on the other hand, again follows from the fact

that the kernel is defined upon a statistic (in fact, the cumulant function) and is then invariant with respect

to the sample signal which is observed. In particular, the equations in 2.b state that two time frames (for

T →∞) aremaximally similarif they are generated by the same systems, possibly driven bydifferent

realizations of the same input processes.

D. Model Estimation

Sincekr in (27) is of positive definite type [38], the Moore-Aronszajn theorem [2] ensures that there

exists only one Hilbert space of functions onRDX ⊗ R

T with kr as reproducing kernel. A possible

approach to build a classifier (20) consists then of learninga non-parametric functionf : RDX⊗R

T → R

by exploiting the so called representer theorem [29], [37],[10]. A related approach, more geometrical

in nature, relies on primal-dual techniques that underlines Support Vector Machines (SVM) and related

estimators [41], [39], [40]. In this case one needs to define apositive definite function onRDX × RDX

that serves to evaluate〈φ(X), φ(Y )〉F where〈·, ·〉F denotes the inner product of some spaceF . This

recipe, known as thekernel trick, avoids the explicit definition of the mappingφ : RDX ⊗R

T → F that

is used (implicitly) to define an affine separating hyperplane (a.k.a. primal model representation):

f(W,b)(X) := 〈W,φ(X)〉F + b . (30)

Based upon the latter one can conveniently design a primal optimization problem to find an optimal

pair (W ⋆, b⋆). In practice, by relying on Lagrangian duality arguments one can then re-parametrized

the problem in terms of a finite number of dual variablesαnn∈NNand solve in(α, b) ∈ R

N+1.

Vapnik’s original SVM formulation [9] translates into convex quadratic programs. By contrast, in least-

squares SVM (LS-SVM) [41], a modification of the SVM primal problem leads to a considerably simpler

estimation problem. In the present setting, in particular,it is possible to show that the estimation can be

performed finding the solution(α⋆, b⋆) of the following system of linear equations [42]:

0 Y ⊤

Y Ω + 1γIN

b

α

=

0

1N

(31)

whereRN ∋ 1N = (1, 1, . . . , 1), Ω ∈ R

N ⊗ RN is defined entry-wise by

(Ω)ij = yiyj〈φ(X(i)), φ(X(j))〉F = yiyj(Kr)ij

andKr is computed as in Algorithm 1.

February 21, 2011 DRAFT

16

Finally to evaluatef(W ⋆,b⋆) at a given test pointX, the dual model representation is exploited:

f(W ⋆,b⋆)(X) =∑

n∈NN

ynα⋆nkr(X

(n),X) + b⋆ (32)

where the evaluation ofkr(X(n), ·) at the test pointX requires the factorization of the matrix unfoldings

of Cr4X

as in point(ii) of Algorithm 1.

As a final remark notice that in Subsection IV-C we explicitlyassumed generating systems with a linear

state-space representations. This is a well studied class of models that are amenable to analytical insights.

Nonetheless the kernel function in (27) only requires the computation of cumulants of the output signals.

Consequently our overall approach to build a classifier doesnot require specific model assumptions and

in particular does not exploit state space representations. In this sense the proposed procedure can be

regarded as adiscriminativeapproach. By contrast the proposals in [48] and [3] that we discussed in

some detail in Section I rely on (estimated) models and hencecan be regarded asgenerativein nature.

V. EXPERIMENTAL RESULTS

The purpose of this Section is to compare the results obtained with the kernel function (27) (sub−C)

with those obtained with a number of alternative kernels. These include kernels that are also defined on

cumulants, namely:

kRBF−C(X,Y ) := exp

(

−1

2σ2‖Cr

4X− Cr

4Y‖2F

)

(33)

klin−C(X,Y ) := 〈Cr4X, Cr

4Y 〉 (34)

as well as kernels that are not based on higher-order statistics and simply act on the concatenation of the

vectors corresponding to each channel:

kRBF(X,Y ) := exp

(

−1

2σ2‖X − Y ‖2F

)

(35)

klin(X,Y ) := 〈X,Y 〉 . (36)

Notice that the latter kernel functions do not exploit the dynamical structure of the data. Nevertheless

they are widely used in the context of machine learning to deal with multivariate time series. First we

validate the proposed approach on a synthetic example.

February 21, 2011 DRAFT

17

A. Simulation Study

1) Generative Models:For n ∈ NN we let yn be drawn from a Bernoulli distribution with success

probability 1/2. We generated the corresponding training output processesX(n) according to

Z(n)(t + 1) = A(n)Z(n)(t) + BW (n)(t) (37)

X(n)(t) = CZ(n)(t) + V (n)(t) (38)

whereA(n),B ∈ R3 ⊗ R

3 andC ∈ R20 ⊗ R

3 were defined as follows:

(A(n)

)

ij:=

sin(ij), if yn = 1

cos(ij), if yn = 0, (B)ij := sin(ij), A(n) = A(n) diag(Q(n))A−1

(n) , (C)ij := cos(ij)

where we letQ(n) ∼ U([−1, 1]3) and W (n)(t) ∼ U([0, 1]3) for any n and t. The noise process was

generated according toV (n)(t) ∼ N (020, ηI20) where020 ∈ R20 is a vector of zeros. In generating the

output measurements we lett ∈ N2000 and formX(n) ∈ R20⊗R

600 based upon the last600 time instants.

Finally we take as input patternX(n) the centered version10 of X(n) namely we let the column ofX(n)

be X(n)i = X

(n)i − 1

600X(n)1600 for i ∈ N600. This same scheme was used to draw theN training pairs

as well as200 input-output pairs used across different experiments for the purpose of testing different

procedures.

2) AUC’s Corresponding to Different Kernel Functions:For the cumulant-based kernel we taker = 1,

that is, we use the first channel as the reference signal. Since all the considered kernels are positive

definite they can be used within any SVM-based algorithm. We opted for the LS-SVMlab toolbox

(www.esat.kuleuven.be/sista/lssvmlab, [11]) which provides a convenient framework to perform automatic

model selection (namely the choice of tuning parameterγ and — where it applies — the kernel parameter

σ). The tuning procedure makes use of coupled simulated annealing [50] to search for tuning parameters

and use10−fold cross-validated misclassification error as an optimization criterion. The performance

was then measured on an independent test set in terms of accuracy of predicting the target variables.

More precisely we report the area under the ROC curve (AUC) obtained on a set of 200 testing pairs

(the same set across different experiments). In Table I we reported the mean (with standard deviation in

parenthesis) of AUC’s obtained by100 models estimated based upon random drawn ofDN for different

values ofN and 2 values ofη (corresponding to 2 different noise levels). Additionallywe reported in

Figure 1 the box-plots of AUC’s forsub−C.

10It is worthy remarking at this point that the cumulants of Section II-C were defined for zero-mean vector processes.

February 21, 2011 DRAFT

18

TABLE I: Accuracy on test data for100 models estimated based upon random drawn ofDN .

AUC’s, η = 1

kernel function N = 24 N = 48 N = 72 N = 96sub−C 0.86(0.06) 0.90(0.02) 0.91(0.01) 0.91(0.01)

lin−C 0.55(0.05) 0.55(0.04) 0.55(0.02) 0.55(0.01)RBF−C 0.68(0.06) 0.70(0.03) 0.70(0.01) 0.71(0.01)

lin 0.53(0.04) 0.54(0.03) 0.56(0.03) 0.57(0.01)RBF 0.57(0.09) 0.55(0.08) 0.60(0.09) 0.61(0.09)

AUC’s, η = 10

kernel function N = 24 N = 48 N = 72 N = 96sub−C 0.61(0.05) 0.64(0.06) 0.68(0.03) 0.69(0.01)

lin−C 0.49(0.03) 0.49(0.03) 0.49(0.02) 0.49(0.01)RBF−C 0.51(0.03) 0.52(0.04) 0.50(0.02) 0.50(0.02)

lin 0.51(0.03) 0.51(0.03) 0.52(0.02) 0.52(0.01)RBF 0.51(0.02) 0.52(0.03) 0.50(0.01) 0.50(0.01)

η = 10, sub−C

0.4

0.5

0.6

0.7

0.8

0.9

1

24 48 72 96

(a)

η = 1, sub−C

0.4

0.5

0.6

0.7

0.8

0.9

1

24 48 72 96

(b)

Fig. 1: Synthetic example: boxplots of AUC’s on test data obtained for increasing number of trainingpatterns. Plot 1(a) refers to high level of noise (η = 10), plot 1(b) to low noise condition (η = 1).

B. Biomag 2010 Data

As a real life application we considered a brain decoding task where the goal is to discriminate the

different mental states of a subject performing a spatial attention task. The experiment was designed

in order to investigate whether a subject could modulate hisor her attention towards a specific spatial

direction (e.g., left or right) so that the related brain activity could be used as the control signal in a

brain-computer interface (BCI) system [46], [27]. Brain data were recorded by means of a magnetoen-

cephalography (MEG) system and distributed by the Biomag 2010 data analysis competition11. The

11Dataset was retrieved at ftp://ftp.fcdonders.nl/pub/courses/biomag2010/competition1-dataset.

February 21, 2011 DRAFT

19

experimental setting of the recordings consisted of multiple repetitions of a basic block, calledtrial , of

the same stimulation protocol. In each trial a fixation crosswas presented on a screen to the subject

and after that a visual cue (an arrow) indicating which direction, either left or right, the subject had to

covertly attend to during the next2500 ms [46]. Data were collected through274 sensors at1200 Hz

for 255 trials per subject and refer to4 subjects12. Two standard pre-processing steps were operated on

all timeseries, specifically: first downsampling from1200 to 300 Hz and secondly band-pass filtering in

order to remove frequency components outside the range5−70 Hz. The brain activity of interest related

to attention tasks is known to fall within that range [46]. Inthis work we present results for subject1

in the original study. We restricted our analysis to a subsetof 45 out of the274 sensors/timeseries of

the initial dataset corresponding to the parietal areas of the brain based on previous findings [46], [27].

Moreover, as a consequence of this selection, the computational cost of computing the cumulant-based

kernel was greatly reduced. As in [46], we restricted our analysis to the last2000 ms of each trial to

remove artifacts occurring immediately after the stimulusoffset. The final dataset consisted of255 trials,

127 corresponding to theleft stimulus and128 to theright stimulus. Each trial consisted of45 timeseries

of 600 time instants.

Additionally, we constructed a second dataset in order to reproduce the state-of-the-art analysis pro-

posed in [46] and compare results against the proposed method. In this second dataset consisted of the

trials were described by the the values of the power spectraldensity (PSD) in the5 − 70 Hz range of

the timeseries instead of the timeseries themselves. This dataset consisted of255 trials each described

by 45 PSDs where each PSD was a vector of129 values.

1) Linearity and Gaussianity Tests:As discussed in previous Sections the proposed cumulant-based

kernel approach does not rely on specific model assumptions.Even so, the case of state-space models

driven by non-Gaussian white processes is amenable to insightful interpretations. Hinich [24] developed

algorithms to test for Gaussianity and linearity of scalar processes. The main idea is based on the two

dimensional Fourier transform of the third order cumulant (8) (calledbispectrum). A simple chi-square

test can be derived to check for Gaussianity [24]. Additionally, for the case where there is indication of

non-Gaussianity one can test for linearity [24]. Here we consider the implementation of the tests given

in the Higher Order Spectral Analysis Toolbox of MATLAB [45]. Figure 2 reports the pattern obtained

from testing the scalar processes corresponding to each oneof the 45 channels and each trial of the 255

available. The presence of a marker (a dot) in correspondence to an entry denotes that the Gaussianity

12The Biomag dataset corresponds to subjects1, 2, 3 and7 respectively in [46].

February 21, 2011 DRAFT

20

1

15

30

45

0 50 100 150 200 250

Fig. 2: Plot obtained from testing each scalar processes forGaussianity for the Biomag Data. The presenceof a marker (a dot) in correspondence to an entry denotes thatthe Gaussianity hypothesis for that scalarprocess can be rejected with confidence.

hypothesis for that scalar process can be rejected with confidence. The analysis shows that96% of the

255 multivariate processes contains at least one componentthat is likely to be non-Gaussian and hence

are plausibly non-Gaussian vector processes. Moreover for82% of the non-Gaussian scalar processes the

linearity hypothesis cannot be rejected13.

2) AUC’s Corresponding to Different Kernel Functions:Out of the 255 recordings available 100 were

selected at random for testing purposes. Within the remaining patternsN were extracted at random in

repeated experiments to form the training setDN . We tested the same kernel functions (33)-(36) as for the

synthetic experiment above as well as the proposed kernel function (27) (sub-C) within the LS-SVMlab

toolbox. Additionally we tested the kernel function (36) onthe PSD dataset in order to reproduce the

analysis in [46]. We denote this kernel function as PSD-lin in the following. Among them only sub-C, lin

and PSD-lin showed behaviors that are better than random guessing. We argue that the estimation of the

remaining kernel classifiers suffered from model selectionissues arising from the high dimensionality of

the data. We reported the results in Table II and the corresponding box-plots in Figure 3.

TABLE II: AUC’s measured on test data for20 models estimated based upon random drawn ofDN .

kernel function N = 10 N = 30 N = 50 N = 70 N = 90 N = 110 N = 130 N = 150sub−C 0.66(0.11) 0.71(0.11) 0.81(0.03) 0.83(0.02) 0.84(0.02) 0.84(0.01) 0.84(0.01) 0.85(0.01)

lin 0.52(0.06) 0.50(0.06) 0.52(0.06) 0.53(0.06) 0.55(0.06) 0.56(0.04) 0.59(0.05) 0.60(0.03)PSD-lin 0.55(0.05) 0.67(0.06) 0.74(0.07) 0.75(0.04) 0.80(0.03) 0.81(0.03) 0.82(0.02) 0.84(0.01)

VI. CONCLUSION

The scientific literature has recently shown that tensor representations of structured data are partic-

ularly effective for improving generalization of learningmachines. This is true in particular when the

13Notice, however, that this fact does not imply that the processes are linear with high confidence.

February 21, 2011 DRAFT

21

PSD− lin

0.45

0.5

0.55

0.6

0.65

0.75

0.85

0.7

0.8

0.9

42

10 30 50 70 90 110 130 150

(a)

sub−C

0.45

0.5

0.55

0.6

0.65

0.75

0.85

0.7

0.8

0.9

10 30 50 70 90 110 130 150

(b)

Fig. 3: Real-life example: boxplots of AUC’s for the best performing kernel functions.

number of training observations is limited. On the other hand kernels lead to flexible models that have

been proven successful in many different contexts. Studying the interplay of the two domains recently

stimulated the interest of the machine learning community.In this spirit this paper investigated the

use of a kernel function that exploits the spectral information of cumulant tensors. These higher order

statistics are estimated directly from observed measurements and do not require neither specific modeling

assumptions nor a foregoing identification step. This translates into a fully data-driven approach to build

a classifier for the discrimination of multivariate signalsthat significantly outperformed in experiments

simpler kernel approaches as well as a state-of-the-art technique. Additionally, as we have shown, an

insightful connection with the dynamics of the generating systems can be drawn under specific modeling

assumptions. This is of particular interest for the case of EG signals that we have considered as an

application.

ACKNOWLEDGMENT

Research supported by Research Council KUL: CIF1 and STRT1/08/023 (2), GOA/11/05 Ambiorics, GOA/10/09 MaNet, CoE EF/05/006 Optimization in Engineering (OPTEC)

en PFV/10/002 (OPTEC), IOF-SCORES4CHEM, several PhD/postdoc and fellow grants; Flemish Government: FWO: PhD/postdoc grants, projects: G0226.06 (cooperative systems and

optimization), G0321.06 (Tensors), G.0427.10N, G.0302.07 (SVM/Kernel), G.0557.08 (Glycemia2), G.0588.09 (Brain-machine) research communities (WOG: ICCoS, ANMMM, MLDM).

IWT: PhD Grants, Eureka-Flite+, SBO LeCoPro, SBO Climaqs, SBO POM, O&O-Dsquare. Belgian Federal Science Policy Office:IUAP P6/04 (DYSCO, Dynamical systems, control and

optimization, 2007-2011). IBBT. EU: ERNSI; FP7-HD-MPC (INFSO-ICT-223854), COST intelliCIS, FP7-EMBOCON (ICT-248940). Contract Research: AMINAL.

REFERENCES

[1] NIPS Workshop on Tensors, Kernels, and Machine Learning(TKML). Whistler, BC, Canada, 2010.

[2] N. Aronszajn. Theory of reproducing kernels.Transactions of the American Mathematical Society, 68:337–404, 1950.

February 21, 2011 DRAFT

22

[3] A. Bissacco, A. Chiuso, and S. Soatto. Classification andrecognition of dynamical models: The role of phase, independent

components, kernels and optimal transport.IEEE Trans. on Pattern Analysis and Machine Intelligence, pages 1958–1972,

2007.

[4] C. Bourin and P. Bondon. Efficiency of high-order moment estimates.IEEE Trans. on Signal Processing, 46(1):255–258,

1998.

[5] S. L. Bressler, C. G. Richter, Y. Chen, and M. Ding. Cortical functional network organization from autoregressive modeling

of local field potential oscillations.Statistics in Medicine, 26(21):3875–3885, 2007.

[6] D. R. Brillinger and M. Rosenblatt.Computation and interpretation ofk−th order spectra, pages 189–232. Advanced

Seminar on Spectral Analysis of Time Series, University of Wisconsin–Madison, 1966. John Wiley, New York, 1967.

[7] B. L. P. Cheung, B. A. Riedner, G. Tononi, and B.; Van Veen.Estimation of cortical connectivity from EEG using

state-space models.IEEE Trans. on Biomedical Engineering, 57(9):2122–2134, 2010.

[8] P. Comon and C. Jutten.Handbook of Blind Source Separation: Independent component analysis and applications.

Academic Press, 2010.

[9] C. Cortes and V. Vapnik. Support vector networks.Machine Learning, 20:273–297, 1995.

[10] F. Cucker and S. Smale. On the mathematical foundationsof learning.Bulletin American Mathematical Society, 39(1):1–50,

2002.

[11] K. De Brabanter, P. Karsmakers, F. Ojeda, C. Alzate, J. De Brabanter, K. Pelckmans, B. De Moor, J. Vandewalle, and

J. A. K. Suykens. LS-SVMlab toolbox user’s guide version 1.7. Internal Report 10-146, ESAT-SISTA, K.U.Leuven (Leuven,

Belgium), 2010.

[12] K. De Cock and B. De Moor. Subspace angles between ARMA models. Systems & Control Letters, 46(4):265–270, 2002.

[13] L. De Lathauwer. Characterizing higher-order tensorsby means of subspaces.Internal Report 11-32, ESAT-SISTA, K.U.

Leuven (Leuven, Belgium), 2011.

[14] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition.SIAM Journal on Matrix

Analysis and Applications, 21(4):1253–1278, 2000.

[15] L. De Lathauwer, B. De Moor, and J. Vandewalle. An introduction to independent component analysis.Journal of

Chemometrics, 14(3):123–149, 2000.

[16] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints.SIAM Journal on

Matrix Analysis and Applications, 20(303), 1998.

[17] G. Favier and A.Y. Kibangou. Tensor-based methods for system identification. International Journal of Sciences and

Techniques of Authomatic Control Computer Engineering (IJ-STA), 3(1):870–889, 2009.

[18] A. Ferreol and P. Chevalier. On the behavior of current second and higher order blind source separation methods for

cyclostationary sources.IEEE Trans. on Signal Processing, 48(6):1712–1725, 2002.

[19] J. A. R. Fonollosa. Sample cumulants of stationary processes: asymptotic results.IEEE Trans. on Signal Processing,

43(4):967–977, 2002.

[20] G. B. Giannakis, Y. Inouye, and J. M. Mendel. Cumulant based identification of multichannel moving-average models.

IEEE Trans. on Automatic Control, 34(7):783–787, 2002.

[21] G. H. Golub and C. F. Van Loan.Matrix Computations. Johns Hopkins University Press, third edition, 1996.

[22] E. J. Hannan.Multiple time series. John Wiley & Sons, Inc. New York, NY, USA, 1971.

[23] X. He, D. Cai, and P. Niyogi. Tensor subspace analysis. In Advances in Neural Information Processing Systems (NIPS),

2006, pages 499–506.

February 21, 2011 DRAFT

23

[24] M. J. Hinich. Testing for Gaussianity and linearity of astationary time series.Journal of time series analysis, 3(3):169–176,

1982.

[25] M. W. Kadous. Temporal classification: extending the classification paradigm to multivariate time series. PhD thesis,

School of Computer Science and Engineering, The Universityof New South Wales, 2002.

[26] T. Kailath, A. H. Sayed, and B. Hassibi.Linear estimation. Prentice Hall Upper Saddle River, NJ, USA, 2000.

[27] S. P. Kelly, E. Lalor, R. B. Reilly, and J. J. Foxe. Independent brain computer interface control using visual spatial

attention-dependent modulations of parieto-occipital alpha. InConference Proceedings of2−nd International IEEE EMBS

Conference on Neural Engineering, pages 667–670, 2005.

[28] H. A. L. Kiers. Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics,

14(3):105–122, 2000.

[29] G. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions.J. Math. Anal. Applic., 33:82–95, 1971.

[30] J. Levine. Finite dimensional filters for a class of nonlinear systems and immersion in a linear system.SIAM Journal on

Control and Optimization, 25:1430, 1987.

[31] J. Liang and Z. Ding. Blind MIMO system identification based on cumulant subspace decomposition.IEEE Trans. on

Signal Processing, 51(6):1457–1468, 2003.

[32] R. J. Martin. A metric for ARMA processes.IEEE Trans. on Signal Processing, 48(4):1164–1170, 2002.

[33] C. L. Nikias and J. M. Mendel. Signal processing with higher-order spectra.Signal Processing Magazine, IEEE, 10(3):10–

37, 2002.

[34] C. L. Nikias and A. P. Petropulu.Higher-order spectra analysis: a nonlinear signal processing framework. Prentice Hall

Upper Saddle River, NJ, USA, 1993.

[35] M. B. Priestley.Spectral analysis and time series. Academic Press, 1981.

[36] P. A. Regalia and S. K. Mitra. Kronecker products, unitary matrices and signal processing applications.SIAM review,

31(4):586–613, 1989.

[37] B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem.Proceedings of the Annual Conference

on Computational Learning Theory (COLT), pages 416–426, 2001.

[38] M. Signoretto, L. De Lathauwer, and J. A. K. Suykens. A kernel-based framework to tensorial data analysis.Internal

Report 10-251, ESAT-SISTA, K.U. Leuven (Leuven, Belgium), 2010.

[39] I. Steinwart and A. Christmann.Support vector machines. Springer Verlag, 2008.

[40] J. A. K. Suykens, C. Alzate, and K. Pelckmans. Primal anddual model representations in kernel-based learning.Statistics

Surveys, 4:148–183 (electronic). DOI: 10.1214/09–SS052, 2010.

[41] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle.Least squares support vector machines.

World Scientific, 2002.

[42] J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifiers.Neural Processing Letters, 9(3):293–

300, 1999.

[43] A. Swami, G. Giannakis, and S. Shamsunder. Multichannel ARMA processes. IEEE Trans. on Signal Processing,

42(4):898–913, 2002.

[44] A. Swami and J. M. Mendel. Time and lag recursive computation of cumulants from a state-space model.IEEE Trans.

on Automatic Control, 35(1):4–17, 1990.

[45] A. Swami, J. M. Mendel, and C. L. Nikias. Higher-order spectral analysis toolbox for use with Matlab. Natick, M.A. :

The Mathworks Inc., 1995.

February 21, 2011 DRAFT

24

[46] M. Van Gerven and O. Jensen. Attention modulations of posterior alpha as a control signal for two-dimensional brain-

computer interfaces.Journal of Neuroscience Methods, 179(1):78–84, April 2009.

[47] M. Vasilescu and D. Terzopoulos. Multilinear analysisof image ensembles: Tensor faces. In7th European Conference

on Computer Vision (ECCV), in Computer Vision – ECCV 2002, Lecture Notes in Computer Science, volume 2350, pages

447–460, 2002, 2002.

[48] S. V. N. Vishwanathan, A. J. Smola, and R. Vidal. Binet-Cauchy kernels on dynamical systems and its application to the

analysis of dynamic scenes.International Journal of Computer Vision, 73(1):95–119, 2007.

[49] H. O. A. Wold. A study in the analysis of stationary time series. Almqvist & Wiksell, 1954.

[50] S. Xavier de Souza, J.A.K. Suykens, J. Vandewalle, and D. Bolle. Coupled simulated annealing.IEEE Trans. on Systems,

Man and Cybernetics-Part B, 40(2):320–335, 2010.

February 21, 2011 DRAFT