Classification of Multichannel Signals With Cumulant-Based Kernels
-
Upload
trismegistos -
Category
Documents
-
view
4 -
download
0
Transcript of Classification of Multichannel Signals With Cumulant-Based Kernels
1
Classification of Multichannel Signals with
Cumulant-based KernelsMarco Signoretto, Emanuele Olivetti, Lieven De Lathauwer and Johan A. K. Suykens
Abstract
We consider the problem of training a classifier given a set oflabelled multivariate time series (a.k.a.
multichannel signals or vector processes). The approach isbased on a novel kernel function that exploits
the spectral information of tensors of fourth order cross-cumulants associated to each multichannel signal.
This kernel is used in combination with a primal-dual technique to define a discriminative classifier which
is entirely data-driven. Nonetheless, insightful connections with the dynamics of the generating systems
can be drawn under specific modeling assumptions. The procedure is illustrated on both a synthetic
example as well as on a brain decoding task where the direction, either left of right, towards where the
subject modulates attention is predicted from magnetoencephalography (MEG) signals. Kernel functions
that are widely used in machine learning do not exploit the structure inherited by the underlying dynamical
nature of the data. A comparison with these kernels as well aswith a state-of-the-art approach shows
the advantage of the proposed technique.
Index Terms
Multiple signal classification, Higher order statistics, Statistical learning, Kernel methods, Brain
computer interfaces.
I. INTRODUCTION
This work is concerned with binary classification of multivariate time series. This task clearly deviates
from the standard pattern recognition framework where one deals with a set of measured attributes that
Marco Signoretto and Johan A. K. Suykens are with the Katholieke Universiteit Leuven, Department of ElectricalEngineering, ESAT, Kasteelpark Arenberg 10, B3001 Leuven (Heverlee), Belgium, tel:+32-(0)16320362, fax:+32-(0)16321970,(e-mail:marco.signoretto,[email protected]).
Emanuele Olivetti is with the NeuroInformatics Laboratory(NILab), Bruno Kessler Foundation and University of Trento,Trento, Italy (e-mail: [email protected]).
Lieven De Lathauwer is with the Group Science, Engineering and Technology of the Katholieke Universiteit Leu-ven Campus Kortrijk, E. Sabbelaan 53, 8500 Kortrijk, Belgium, tel: +32-(0)56326062, fax: +32-(0)56246999, (e-mail:[email protected]).
February 21, 2011 DRAFT
2
do not change over time. A possible approach is based on the construction of metafeatures that exploit
the additional structure inherited by the dependency over time [25] and then apply standard pattern
recognition techniques on the derived set of attributes. Anappealing class of tailored methodologies
is based on probabilistic models, such as Hidden Markov models (HMMs), that have found successful
applications in areas like speech, handwriting and gesturerecognition.
In this paper we investigate a kernel-based discriminativemethodology for the classification of mul-
tivariate signals. The idea makes use of higher order statistics to build a suitable similarity measure
starting from the measured signals. In this respect the kernel-based approach in [3] differs since it is
based on a two step procedure that first performs the blind identification of the generating systems
(one for each multivariate time series) and successively carries out the classification by endowing the
space of such models with a distance measure. Similarly the family of kernels found in [48] requires
again the knowledge of a model and hence demands for a foregoing identification step. Notice that this
step poses many challenges. In fact most of the identification techniques involve multistage iterative
procedures that can suffer from the problem of multiple local minima. The approach proposed in [3], for
example, requires to first estimate a set of candidate models. This amounts at 1) determining empirically
the dimension of the state process 2) finding the system matrices and finally 3) solving the Riccati
equation to obtain a set of input-related matrices that matches the statistical properties of the output
process. Successively for each candidate model its inverseis used to deconvolve the observed output
and recover the input process and the initial state. Remarkably, this deconvolution step is a nontrivial
process given that the generating systems might have non-minimal phase (as observed for the case of
the human motion [3]). An independence criterion is then used to select a suitable model in the bag of
candidates. After an additional inference step is taken, one gets to the parametric representation of the
underlying dynamical system and can finally compute a system-based kernel to measure the similarity
between the estimated models. On the contrary, the approachthat we propose does not require postulating
any specific model class nor the explicit identification of systems. In fact, the idea can be seen as entirely
data driven. Nonetheless, if a certain structure can be assumed interesting properties arise. In Section
IV-C, for instance, we draw the link between the proposed similarity measure and the latent state process
in the case of multi-input multi-output (MIMO) linear and time-invariant state-space models driven by
a non-Gaussian white process. This might provide useful insights on certain processes of interest as for
the case of encephalography (EG) signals that we consider. Another important difference with respect to
[3] is that we assume that the effect of transient due to initial conditions is negligible in the observed
signals. This is supported by the fact that in many practicalapplications the transient is discarded since
February 21, 2011 DRAFT
3
it is believed to carry information potentially misleadingfor the analysis. For instance in the context
of neuroscience, it is known that the EG signals immediatelyafter the presentation of certain stimuli
are non informative with respect to the mental processes of interest [46]. Likewise we assume that no
marginally stable modes (corresponding to poles on the unitcircle) are present and hence the steady-state
output does not present persistent oscillations1. This assumption is usually met since in many practical
applications signals are preprocessed to avoid persistentrhythmical components. In biosignal processing,
for instance, these are often the effect of artifacts such asbreathing and heart beating.
First of all we discuss a framework for the discrimination oftime-windows of multivariate signals
that extends the classical formalism of statistical learning. Secondly we introduce a kernel function that
exploits the structure inherited by the dynamical nature ofthe data. We then show properties of such a
similarity measure under specific assumptions on the generating mechanism. Finally we test the proposed
approach on a domain that recently attracted wide interest in the scientific community.
In the next Section we recall the definition of cumulants of (vector) processes and present instrumental
properties that relate time series with their generating dynamical system. In Section III we formalize the
problem of interest. Section IV presents our approach. Section V is then concerned with the application
that motivates this study, namely classification of MEG signals. Both synthetic as well as real life examples
are considered. We conclude by drawing our final remarks in Section VI.
II. T ENSORS ANDCUMULANTS
We denote scalars by lower-case letters (a, b, c, . . .), vectors as capitals (A,B,C, . . .) and matrices as
bold-face capitals (A,B,C, . . .). In particular we denoted byIN the N × N identity matrix. We also
use lower-case lettersi, j in the meaning of indices and with some abuse of notation we will use I, J
to denote the index upper bounds. Additionally we writeNI to denote the set1, . . . , I. We write ai
to mean thei−th entry of a vectorA. Similarly we writeaij to mean the entry with row indexi and
column indexj in a matrixA, i.e., aij = (A)ij .
A. Tensors
Before proceeding we recall thatN−th order tensors, which we denote by calligraphic letters (A,
B, C, . . .), are higher order generalizations of vectors (first order tensors) and matrices (second order
tensors). Scalars can be seen as tensors of order zero. We write ai1,...,iNto denote(A)i1,...,iN
. An N−th
1This corresponds to having a purely stochastic vector process: no deterministic component is present in the Wold’sdecomposition of the process [49].
February 21, 2011 DRAFT
4
order tensorA has rank-1 if it consists of the outer product ofN non-zero vectorsU (1) ∈ RI1, U (2) ∈
RI2 , . . . , U (N) ∈ R
IN , that is, if
ai1i2...iN= u
(1)i1
u(2)i2· · · u
(N)iN
(1)
for all values of the indices. The linear span of such elements form a vector space, denoted byRI1 ⊗
RI2 ⊗ · · · ⊗ R
IN , which is endowed with the inner product
〈A,B〉 :=∑
i1
∑
i2
· · ·∑
iN
ai1i2···iNbi1i2···iN
(2)
and with the Hilbert-Frobenius norm [14]
‖A‖F :=√
〈A,A〉 . (3)
It is often convenient to rearrange the elements of a tensor so that they form a matrix. This operation is
referred to asmatricizationor unfolding.
Definition II.1 (n−mode matricization [28]). Assume anN−th order tensorA ∈ RI1 ⊗ · · · ⊗R
IN . The
n−th mode matrix unfolding, denoted asA〈n〉, is the matrix
RIn ⊗ R
J ∋ A〈n〉 : a(n)inj := ai1i2...iN
(4)
whereJ :=∏
p∈NN−1\n
Ip and for R := [n + 1, n + 2, · · · , N, 1, 2, 3, · · · , n− 1] we have2:
j = 1 +∑
l∈NN−1
(irl− 1)
∏
p∈Nl−1
Irp
=: q(ir1, ir2
, · · · , irN−1) . (5)
We conclude this introductory part on tensors by defining theinstrumental operation of n-mode product.
Definition II.2 (n-mode product [14]). Then−mode product of a tensorA ∈ RI1 ⊗R
I2 ⊗ · · · ⊗RIN by
a matrixU ∈ RJn⊗R
In, denoted byA×n U , is a (I1× I2×· · ·× In−1×Jn× In+1×· · ·× IN )−tensor
with entries
(A×n U)i1i2···in−1jnin+1···iN:=
∑
in∈NIn
ai1i2···in−1inin+1···iNujnin
. (6)
2Equation (5) simply defines a rule (in fact, a bijectionq) that uniquely assigns to(ir1 , ir2 , · · · , irN−1) the correspondingcolumn indexj of A〈n〉.
February 21, 2011 DRAFT
5
B. Cumulants of Scalar Processes
Cumulants are higher order statistics [34], [33] that have received interest especially for the identi-
fication of linear and non-linear systems [17], [20] as well as for blind source separation [8]. For a
real-valued discrete-time and zero-mean stationary scalar processx(t) [22], theJ−th order cumulant
function cx : ZJ−1 → R is defined in terms of the so calledcumulant generating function[35]. In
particular, the second order cumulant is
cx(l1) := E [x(t)x(t + l1)] . (7)
The third and fourth order cumulants are respectively givenby:
cx(l1, l2) := E [x(t)x(t + l1)x(t + l2)] (8)
and
cx(l1, l2, l3) := E [x(t)x(t + l1)x(t + l2)x(t + l3)]− E [x(t)x(t + l1)] E [x(t + l2)x(t + l3)]
− E [x(t)x(t + l2)] E [x(t + l3)x(t + l1)]− E [x(t)x(t + l3)] E [x(t + l1)x(t + l2)] . (9)
C. Cumulants of Vector Processes
In order to introduce higher-order statistics for a realD−variate discrete-time stationary vector process3
X(t) one needs the notion ofcross-cumulantsbetween scalar entries of the vector. IfE [X(t)] = 0
then similarly to (7)-(9) one has:
cXd1d2
(l1) := E [xd1(t)xd2
(t + l1)] (10)
which is the second order cross-cumulant between thed1−th andd2−th entries of the vector processX,
cXd1d2d3
(l1, l2) := E [xd1(t)xd2
(t + l1)xd3(t + l2)] (11)
3Sometimes we simply indicate it asX.
February 21, 2011 DRAFT
6
which is the third order cross-cumulant between thed1−th, d2−th andd3−th entries of the vector process
X and
cXd1d2d3d4
(l1, l2, l3) := E [xd1(t)xd2
(t + l1)xd3(t + l2)xd4
(t + l3)]
− E [xd1(t)xd2
(t + l1)] E [xd3(t + l2)xd4
(t + l3)]− E [xd1(t)xd3
(t + l2)] E [xd4(t + l3)xd2
(t + l1)]
− E [xd1(t)xd4
(t + l3)] E [xd2(t + l1)xd3
(t + l2)] (12)
which is the fourth order cross-cumulant between thed1−th, d2−th, d3−th andd4−th entries of the
vector processX.
D. Cumulant Tensors
We now bring in the tensor representations of cumulants thatwe consider in this paper. To begin with,
for a given finiteJ ∈ N and anyi ∈ NJ−1, let Li ⊂ Z be an arbitrary finite set with cardinalitySi and
denote bylis its s−th element. Consider now the collection of∏
i∈J−1 Si integer tuples obtained as
L := L1 ×L2 × · · · ×LJ−1 . (13)
Definition II.3 (L−Cumulant Tensor). We callL−cumulant tensor, denoted asCXL
, the
(2J − 1)−th order tensor
CXL ∈ R
D ⊗ RD ⊗ · · · ⊗ R
D
︸ ︷︷ ︸
J times
⊗RS1
⊗ RS2
⊗ · · · ⊗ RSJ−1
defined entry-wise by
(CX
L
)
d1d2···dJs1s2···sJ−1:= cX
d1d2···dJ(l1s1
, l2s2, . . . , lJ−1
sJ−1) (14)
for any tuple[d1, d2, · · · , dJ , s1, s2, · · · , sJ−1] ∈ ND ×ND · · · ×ND × NS1 × NS2 × · · · × NSJ−1 .
Example II.4. As a concrete example suppose thatX is a zero-mean4−variate (D = 4) stationary
process. TakeJ = 3 and letL1 = −3, 1 andL2 = −3, 0, 2. In this caseL = (−3,−3), (−3, 0),
(−3, 2), (1,−3), (1, 0), (1, 2), and each entry of the cumulant tensorCL , indexed by a tuple[d1, d2, d3,
s1, s2] ∈ N4 ×N4 ×N4 ×N2 ×N3, is a 3rd order cross-cumulant between thed1−th, d2−th andd3−th
entries of the vector processX.
A special case that arises from the previous definition corresponds to the situation where for anyi ∈
February 21, 2011 DRAFT
7
NJ−1, the setLi is a singleton. In this caseL is also a singleton, i.e., we haveL =
[l11, l21, . . . , lJ−11 ]
for a single fixed tuple of integers[l11, l21, . . . , lJ−11 ]. Additionally J−1 dimensions ofCX
Lare degenerate,
that is,CXL
can be equivalently represented as aJ−th order tensor.
Definition II.5 ((l1, l2, . . . , lJ−1)-Cumulant Tensor). We call (l1, l2, . . . , lJ−1)−cumulant tensor,
denoted asCX(l1, l2, . . . , lJ−1), the J−th order tensor
CX(l1, l2, . . . , lJ−1) ∈ RD ⊗ R
D ⊗ · · · ⊗ RD
︸ ︷︷ ︸
J times
defined entry-wise by
(CX(l1, l2, . . . , lJ−1)
)
d1d2···dJ:= cX
d1d2···dJ(l1, l2, . . . , lJ−1) . (15)
In the remainder of this paper we simply writecumulant tensorwhen the specific notion that we are
referring to is clear from the context. Additionally we write, respectively,CXL
andCX(l1, l2, . . . , lJ−1) to
mean the empirical (a.k.a. sample) versions of the cumulanttensors introduced in the previous definitions.
These are simply obtained by replacing the expectation operator by thetime averagein the definitions of
cross-cumulants (10)-(12) [19], [43]. For instance, suppose thatX(1),X(2), · · · ,X(T ) is a given sam-
ple from the processX. Then if we letT0 := max 1, 1 − l1, 1− l2 andT ′ := min T, T − l1, T − l2,
the empirical version of (11) is obtained as:
cXd1d2d3
(l1, l2) :=1
T ′ − T0 + 1
T ′
∑
t=T0
xd1(t)xd2
(t + l1)xd3(t + l2) .
We also assume (as done everywhere else, at least implicitly) that sample versions of cumulants do
converge towards their theoretical counterpart forT → ∞. This is true, in particular, wheneverX is
strictly stationaryand ergodic and E[X2d (t)] < ∞ for all d ∈ ND [22, Section IV.2, Theorem 2]4.
However, as known, the property of ergodicity is not observationally verifiable from a single realization
of a time series [22].
We now link a class of time series with their generating dynamical system. This will allow us to derive
useful properties that support the use of cumulant-based representations for the task of discriminating
between classes of multichannel signals that we consider inthe next Section.
4Different is the situation of signals driven bycyclostationarysources that we do not consider here. In this situation, thatarise for instance in the context of radiocomunications, itis known that the empirical estimators arising from time averages arebiased. See, e.g., [18] and [8, Section 17.2.2.2].
February 21, 2011 DRAFT
8
E. Cumulants of Processes with State Space Representation
Suppose that, for the given stationary processX, the z-spectrum5:
S(z) :=∑
l∈Z
CX(l)z−l
is a rational function and has maximal normal rank everywhere on the complex unit circle. In this case a
fundamental result [26] shows that there areA,B ∈ RDZ ⊗R
DZ ,C ∈ RDX ⊗R
DZ and white processes6
W (input process) andV (noise process) such thatX can be interpreted as the output process of the
MIMO stable systemΣ = (A,B,C) in state-space form:
Z(t + 1) = AZ(t) + BW (t) (16)
X(t) = CZ(t) + V (t) . (17)
Conversely starting from the generating dynamical system (16)-(17) we have the following basic result.
Proposition II.1 (State-space Model: Cumulant of the Output Process). Assume a stable state space
model(16)-(17) specified as above and suppose additionally that the processesW and V are mutually
independent. Then for any tuple[l1, l2, . . . , lJ−1] ∈ ZJ−1 in steady state we have:
CX(l1, l2, . . . , lJ−1) = CZ(l1, l2, . . . , lJ−1) ×1 C ×2 C ×3 · · · ×J−1 C + CV (l1, l2, . . . , lJ−1) . (18)
Proof: In steady state the state vector results from a linear filtering of the input processW and
hence is statistically independent of the noise processV . The result now follows immediately from basic
properties of cumulants, namely multilinearity and their additivity under the independence assumption
(see e.g. [15, Sec. 2.2 properties 2 and 5]).
Results in line with Proposition II.1 appeared in related forms elsewhere. Cumulants of the output
process of a stable single-input single-output linear time-invariant system were studied in [6] in terms of
the system matrices. Cumulants of a broader class of linear MIMO systems (time-varing with possibly
non-stationary inputs) were the subject of [44]. Additionally notice that the cumulant of the state process
in (16) can also be given in terms of the system matricesA and B and the input process cumulant.
Our aim however is simply to illustrate the advantage of cumulants for discriminating between observed
5Notice thatCX(l), corresponds toE[X(t)X(t + l)⊤].6We recall that a stationary stochastic processX(t) is white if E[X(i)X⊤(j)] = Qδij where δ is the Kronecker delta
function andQ denotes some covariance matrix.
February 21, 2011 DRAFT
9
multivariate processes. To see this, recall that the cumulant of a white Gaussian process vanishes as
soon asJ > 2 [34]. Moreover, if a real random variable has symmetric distribution about the origin,
then all its odd cumulants vanish [34]. In many cases this last condition is naturally verified for the
entries of the state vector and one then considersJ ≥ 4. In fact, this choice ofJ and equation (18)
ensures that the output cumulant is solely determined by theunderlying dynamical system and the input
distribution and, in particular, is independent of the noise process. Generally speaking, it becomes harder
to estimate cumulants from sample data as the order increases, i.e. longer data sets are required to obtain
the same accuracy [4], [19], [43]. In the following we stick to J = 4, also used in the context of system
identification [31].
Finally notice that, in principle, the linearity of the model (16)-(17) is somewhat restrictive: for
general systems both the state update as well as the measurement map from the state vectorZ(t) to
the output vectorX(t) might be governed by nonlinear equations. Nonetheless it turns out that under
certain assumptions nonlinear systems have the same input-output behavior of linear systems which have
possibly a higher dimensional state process [30].
III. C LASSIFICATION OF MULTICHANNEL SIGNALS
A. General Problem Statement
It is assumed that we are given a dataset ofN input-target7 pairs
DN =(
X(n), yn
)
∈ RDX ⊗ R
T × −1, 1 : n ∈ NN
drawn in a i.i.d. manner from an unknown fixed joint distribution PXY . Specifically for a sampling
interval ∆ and t0 > 0 the n−th input pattern
X(n) =[
X(n)(1), X(n)(2), · · · , X(n)(t), · · · , X(n)(T )]
(19)
represents the evolution along the time instants∆ + t0, 2∆ + t0, · · · , t∆ + t0, · · · , T∆ + t0 of the
DX−variate processX(n). Notice that the fixed marginal distributionPX (as well as the conditional
distribution PX|Y =y on RDX ⊗ R
T ) is induced by the underlying mechanism generating the stationary
multivariate time-series. With reference to (16) an interesting modeling assumption might correspond to
the case where(A,B,C) ∼ PABC wherePABC is some probability measure on the class of state-space
7We reserve the wordoutputto refer to the output process whereas we usetarget to address the target label. To avoid confusionwe also stress that an input patternX is formed upon a time frame extracted from what corresponds to — in the standardstate-space nomenclature — the output processX (compare equation (19) with equation (16)).
February 21, 2011 DRAFT
10
models. Similarly a conditional model might be(A(y),B(y),C(y)) ∼ PABC|Y =y where y is a given
class label. We shall illustrate this idea in a non technicalmanner by means of a concrete application
in the following Subsection. Meanwhile notice that the taskof interest amounts at predicting the labely
corresponding to a test patternX by means of aclassifier
f : RDX ⊗R
T → −1, 1 . (20)
This is accomplished by finding a real-valued functionf : RDX ⊗R
T → R so that its quantized version
f = sign f is as close as possible to some minimizerh of a predefined performance criterion, such as
the expected risk based on the0− 1 loss:
h ∈ argminEXY [I(h(X) − y)] s.t. h : R
DX ⊗ RT → −1, 1
whereI is the indicator functionI(x) =
1, x 6= 0
0, otherwise .
B. The Case of EG Signals
Among others, a state-space framework for multichannel electroencephalography (EEG) signals was
proposed in [5], [7] with the purpose of modeling cortical connectivity8. In this case (16)-(17) is assumed
to describe the physics behind the cortical dynamics. Specifically the state processZ represents the latent
cortical signals. On the other hand, the output processX represents the evolution in time of the electric
potentials (the multichannel EEG signals) measured at the scalp. Equation (17) expresses them as a noisy
linear combination of the cortical signals.
Suppose now that theN training inputs refer to multichannel EG signals measured from N different
subjects on certain time windows. Each input observation islabelled according to the class to which the
corresponding subject belongs. Then one might assume that,for each class, there is an underlying nominal
system (induced by the class-specific average brain) and a subject-dependent displacement from it. For
experiments on a single subject, on the other hand, one mightassume that the subject-specific system
remains unchanged whereas what changes is the cortical activity driven by external stimuli belonging to
different categories.
IV. CUMULANT -BASED KERNEL APPROACH FORMULTICHANNEL SIGNALS
In Section II-E we have shown that a representation of multivariate processes by means of cumulants
enjoys certain favorable properties. We now exploit this representation in order to establish a similarity
8We argue that a similar state-space representation is plausible also to describe MEG signals.
February 21, 2011 DRAFT
11
measure between two arbitrary input patternsX1 and X2 formed by time windows of output vector
processes. We do this by relying on a tensorial kernel that exploits the spectral information of cumulant
tensors. We begin by recalling briefly from [38] the notion offactor kernels and that of tensorial kernel.
Successively we specialize to the case of cumulant tensors.
A. Factor Kernels and Tensorial Kernel
Let X〈p〉 andY〈p〉 denote thep−th matrix unfolding of two generic elementsX andY of the same
vector spaceRI1 ⊗RI2 ⊗ · · · ⊗R
IP . The singular value decomposition (SVD) ofX〈p〉 can be factorized
in a block-partitioned fashion as follows:
X〈p〉 =(
U(p)X ,1 U
(p)X ,2
)
S
(p)X ,1 0
0 0
V
(p)⊤X ,1
V(p)⊤X ,2
(21)
where0’s are matrices of zeros with appropriate sizes. A similar decomposition holds forY〈p〉. Let now
g(V(p)X ,1,V
(p)Y ,1) be the distance function given by:
g(
V(p)X ,1,V
(p)Y ,1
)
:=∥∥∥V
(p)X ,1V
(p)⊤X ,1 − V
(p)Y ,1V
(p)⊤Y ,1
∥∥∥
2
F. (22)
The latter is defined in terms of the subspaces spanned by the right singular vectors corresponding to
non-vanishing singular values and is known asprojection Frobenius-norm[16], [38]. It can be restated
in terms ofprincipal anglesthat, in turn, are commonly used to define a metric over lineardynamical
systems [12], [32]. We can now introduce thep−th factor kernelaccording to
kp(X ,Y) := exp
(
−1
2σ2g(V
(p)X ,1,V
(p)Y ,1)
)
(23)
whereσ is a user-defined tuning parameter. Finally we construct thepositive definitetensorial kernel
functionk as the pointwise product of the factor kernels [38]
k(X ,Y) :=∏
p∈NP
kp(X ,Y) . (24)
The latter is a measure of similarity based upon the spectralcontent of the different matrix unfoldings,
see [38] for details. Additionally we notice that the induced similarity measure can be studied from an
algebraic perspective starting from a characterization ofthe subspaces associated to the matrix unfoldings
[13].
February 21, 2011 DRAFT
12
B. Cumulant-based Kernel Functions
We consider now a specific type of tensors, i.e.,L−cumulant tensors that were introduced in Section
II-D. Notice that, since we stick to the case of fourth order cumulants, these tensors have order7.
Additionally, one hasL := L1 × L2 × L3 where, for anyi ∈ 1, 2, 3, Li is an arbitrary finite
subset ofZ. Given the general definition of tensorial kernel (24) it is now straightforward to introduce
a cumulant-based kernelkL : RDX ⊗ R
T × RDX ⊗ R
T → R according to:
kL (X1,X2) := k(
CX1
L, CX2
L
)
(25)
where the empirical versions of the cumulants are computed based upon the observed time framesX1
andX2. Notice that the freedom in constructingL gives rise to a whole family of such kernel functions.
For generalL , however, the evaluation ofkL at X1 andX2 requires computing the SVD of14 matrices
(7 for both arguments). The sizes of these matrices depend uponDX and the cardinalities ofLi, i ∈ N3.
Since in many practical applicationsDX is of the order of tens or even hundreds,kL might be prohibitive
to evaluate. As discussed in Section II-D, however, the degenerate case whereL is singleton implies a
significant reduction in the order of the tensor. In the present settingCX1
LandCX2
Lcan be equivalently
represented as4-th order tensor whenL is singleton. For this reason we consider the special case where
L = [0, 0, 0] and hence contains only the zero-lags tuple. A similar restriction is generally considered
in the context of blind source separation [18]. An additional reduction can be achieved by fixing a specific
channel, indexed byr ∈ ND, that is used as areference signal. To see this, let us defineCr4X
entry-wise
by:
(Cr4X)d1d2d3
:=(
CX(0, 0, 0))
d1d2d3r∀ [d1, d2, d3] ∈ NDX
× NDX× NDX
. (26)
Notice thatCr4X
is a third order tensor whereasCX(0, 0, 0) is a fourth order tensor. We can then define
the kernel functionkr : RDX ⊗ R
T × RDX ⊗ R
T → R according to :
kr(X,Y ) := k1(Cr4X, Cr
4Y )k2(Cr4X, Cr
4Y )k3(Cr4X, Cr
4Y ) . (27)
The procedure in Algorithm 1 shows the steps required to evaluatekr for the input patterns inDN .
The advantage of the reference signal directly impacts on the computation at line (i) and at line (ii) of
Algorithm 1. When needed an higher level of reduction can be achieved with multiple reference signals.
February 21, 2011 DRAFT
13
Algorithm 1: KERNELMATRIX (DN , σ, r)
for each n ∈ NN
comment:Precompute cumulant tensors and SVD’s
do
compute Cr4X(n)(see (26), (15) and (12)) (i)
A(n) ← Cr4X(n)
for each i ∈ N3
do
V(i)
A(n),1← SVD(A
(n)<i>) (ii)
for each n, m ∈ NN and m > n
comment:Compute entries of the kernel matrix K
do
for each i ∈ N3
do ai ← g(
V(i)
A(n),1, V
(i)
A(m),1
)
(Kr)nm ← a1a2a3
Kr ←Kr + K⊤r + IN
return (Kr)
C. Connection with the System Dynamics
The properties of cumulants recalled in the previous Section enable to bridge the gap between the
kernel constructed on the observed outputs and the dynamicsof the generating systems. In particular, we
have the followinginvariance properties.
Theorem IV.1 (Asymptotical invariances). Assume systemsΣ1 = (A1,B1,C1) and Σ2 = (A2,B2,C2)
driven by, respectively, mutually independent white ergodic processes(W1, V1) and mutually independent
white ergodic processes(W2, V2). Additionally assume thatV1 andV2 are Gaussian. LetX1 ∈ RDX⊗R
T
and X2 ∈ RDX ⊗ R
T be given realizations from time-frames of the output processes taken from the
steady-state behavior of the two systems. Then forT →∞ and anyi ∈ N3
1) if C1 = C2 the evaluation of thei−th factor can be restated as:
ki(Cr4X1
, Cr4X2
) = exp
(
−1
2σ2g
(
V(i)CZ1(0,0,0),1,V
(i)CZ2(0,0,0),1
))
(28)
and hence depends only on the fourth order cross-cumulants of the state processes.
2) if X ′1,X
′2 ∈ R
DX ⊗RT are also output realizations obtained from systemsΣ1 and Σ2 respectively,
February 21, 2011 DRAFT
14
then we have:
a) ki(Cr4X
′
1, Cr
4X′
2) = ki(Cr
4X1, Cr
4X2)
b) ki(Cr4X
′
1, Cr
4X1) = 1 and ki(Cr
4X′
2, Cr
4X2) = 1.
Proof: We indicate here byX (resp.X ′) the generic output process, eitherX1 or X2 (resp.X ′1 or
X ′2). First of all notice that, since the output processes are ergodic, for T →∞ we have:
(Cr4X
)d1d2d3→
(CX(0, 0, 0)
)
d1d2d3r∀ [d1, d2, d3] ∈ NDX
×NDX× NDX
.
In particular, this implies thatCr4X≡ Cr
4X′ from which the equations in 2.a and 2.b immediately follow.
In order to proof the first part notice now that, without loss of generality9, we can considerC to be
orthogonal. By Proposition II.1 we have:
CX(0, 0, 0) = CZ(0, 0, 0) ×1 C ×2 C ×3 · · · ×J−1 C + CV (0, 0, 0)
whereCV (0, 0, 0) vanishes by the Gaussianity assumption on the noise processes. Denote byA⊗B the
Kronecker product between any two matricesA andB. Thei−th mode unfolding(Cr4X
)<i>
corresponds
now to(CX(0, 0, 0)
)
<i>= C
(CZ(0, 0, 0)
)
<i>(C ⊗C ⊗C)⊤
where we used a basic property on the matricization operation [14], [36]. Notice thatC ⊗C ⊗C is an
orthogonal matrix and hence ifU (i)CZ(0,0,0),1S
(i)CZ(0,0,0),1V
(i)⊤CZ(0,0,0),1 is the thin SVD [21] factorization of
(CZ(0, 0, 0)
)
<i>then for the right singular vectorsV (i)
Cr4X
,1 we have
V(i)Cr4X
,1 = (C ⊗C ⊗C) V(i)CZ(0,0,0),1. (29)
Replacing (29) in the formula that giveski(Cr4X1
, Cr4X2
) and exploiting the orthogonality ofC ⊗C ⊗C
now proofs (28).
The first part of the former result is best understood in lightof the example of state-space models for
EG signals given in Section III-B. For this case it means that, although the factor kernels are defined on
the electric potentials, under the given assumptions they depend uniquely on the latent cognitive processes
(actually their cumulants). Hence each factor kernelki(Cr4X1
, Cr4X2
) (and, as a consequence, the kernel
function (27)) can be envisioned in this case as a similaritymeasure between the latent cognitive states.
This largely depends on the fact that, being defined on cumulant tensors, each factor kernel is invariant
9In fact, as well known, there is an entire class of state spacerepresentations that are equivalent up to similarity transformations.
February 21, 2011 DRAFT
15
with respect to Gaussian output noise. The second part, on the other hand, again follows from the fact
that the kernel is defined upon a statistic (in fact, the cumulant function) and is then invariant with respect
to the sample signal which is observed. In particular, the equations in 2.b state that two time frames (for
T →∞) aremaximally similarif they are generated by the same systems, possibly driven bydifferent
realizations of the same input processes.
D. Model Estimation
Sincekr in (27) is of positive definite type [38], the Moore-Aronszajn theorem [2] ensures that there
exists only one Hilbert space of functions onRDX ⊗ R
T with kr as reproducing kernel. A possible
approach to build a classifier (20) consists then of learninga non-parametric functionf : RDX⊗R
T → R
by exploiting the so called representer theorem [29], [37],[10]. A related approach, more geometrical
in nature, relies on primal-dual techniques that underlines Support Vector Machines (SVM) and related
estimators [41], [39], [40]. In this case one needs to define apositive definite function onRDX × RDX
that serves to evaluate〈φ(X), φ(Y )〉F where〈·, ·〉F denotes the inner product of some spaceF . This
recipe, known as thekernel trick, avoids the explicit definition of the mappingφ : RDX ⊗R
T → F that
is used (implicitly) to define an affine separating hyperplane (a.k.a. primal model representation):
f(W,b)(X) := 〈W,φ(X)〉F + b . (30)
Based upon the latter one can conveniently design a primal optimization problem to find an optimal
pair (W ⋆, b⋆). In practice, by relying on Lagrangian duality arguments one can then re-parametrized
the problem in terms of a finite number of dual variablesαnn∈NNand solve in(α, b) ∈ R
N+1.
Vapnik’s original SVM formulation [9] translates into convex quadratic programs. By contrast, in least-
squares SVM (LS-SVM) [41], a modification of the SVM primal problem leads to a considerably simpler
estimation problem. In the present setting, in particular,it is possible to show that the estimation can be
performed finding the solution(α⋆, b⋆) of the following system of linear equations [42]:
0 Y ⊤
Y Ω + 1γIN
b
α
=
0
1N
(31)
whereRN ∋ 1N = (1, 1, . . . , 1), Ω ∈ R
N ⊗ RN is defined entry-wise by
(Ω)ij = yiyj〈φ(X(i)), φ(X(j))〉F = yiyj(Kr)ij
andKr is computed as in Algorithm 1.
February 21, 2011 DRAFT
16
Finally to evaluatef(W ⋆,b⋆) at a given test pointX, the dual model representation is exploited:
f(W ⋆,b⋆)(X) =∑
n∈NN
ynα⋆nkr(X
(n),X) + b⋆ (32)
where the evaluation ofkr(X(n), ·) at the test pointX requires the factorization of the matrix unfoldings
of Cr4X
as in point(ii) of Algorithm 1.
As a final remark notice that in Subsection IV-C we explicitlyassumed generating systems with a linear
state-space representations. This is a well studied class of models that are amenable to analytical insights.
Nonetheless the kernel function in (27) only requires the computation of cumulants of the output signals.
Consequently our overall approach to build a classifier doesnot require specific model assumptions and
in particular does not exploit state space representations. In this sense the proposed procedure can be
regarded as adiscriminativeapproach. By contrast the proposals in [48] and [3] that we discussed in
some detail in Section I rely on (estimated) models and hencecan be regarded asgenerativein nature.
V. EXPERIMENTAL RESULTS
The purpose of this Section is to compare the results obtained with the kernel function (27) (sub−C)
with those obtained with a number of alternative kernels. These include kernels that are also defined on
cumulants, namely:
kRBF−C(X,Y ) := exp
(
−1
2σ2‖Cr
4X− Cr
4Y‖2F
)
(33)
klin−C(X,Y ) := 〈Cr4X, Cr
4Y 〉 (34)
as well as kernels that are not based on higher-order statistics and simply act on the concatenation of the
vectors corresponding to each channel:
kRBF(X,Y ) := exp
(
−1
2σ2‖X − Y ‖2F
)
(35)
klin(X,Y ) := 〈X,Y 〉 . (36)
Notice that the latter kernel functions do not exploit the dynamical structure of the data. Nevertheless
they are widely used in the context of machine learning to deal with multivariate time series. First we
validate the proposed approach on a synthetic example.
February 21, 2011 DRAFT
17
A. Simulation Study
1) Generative Models:For n ∈ NN we let yn be drawn from a Bernoulli distribution with success
probability 1/2. We generated the corresponding training output processesX(n) according to
Z(n)(t + 1) = A(n)Z(n)(t) + BW (n)(t) (37)
X(n)(t) = CZ(n)(t) + V (n)(t) (38)
whereA(n),B ∈ R3 ⊗ R
3 andC ∈ R20 ⊗ R
3 were defined as follows:
(A(n)
)
ij:=
sin(ij), if yn = 1
cos(ij), if yn = 0, (B)ij := sin(ij), A(n) = A(n) diag(Q(n))A−1
(n) , (C)ij := cos(ij)
where we letQ(n) ∼ U([−1, 1]3) and W (n)(t) ∼ U([0, 1]3) for any n and t. The noise process was
generated according toV (n)(t) ∼ N (020, ηI20) where020 ∈ R20 is a vector of zeros. In generating the
output measurements we lett ∈ N2000 and formX(n) ∈ R20⊗R
600 based upon the last600 time instants.
Finally we take as input patternX(n) the centered version10 of X(n) namely we let the column ofX(n)
be X(n)i = X
(n)i − 1
600X(n)1600 for i ∈ N600. This same scheme was used to draw theN training pairs
as well as200 input-output pairs used across different experiments for the purpose of testing different
procedures.
2) AUC’s Corresponding to Different Kernel Functions:For the cumulant-based kernel we taker = 1,
that is, we use the first channel as the reference signal. Since all the considered kernels are positive
definite they can be used within any SVM-based algorithm. We opted for the LS-SVMlab toolbox
(www.esat.kuleuven.be/sista/lssvmlab, [11]) which provides a convenient framework to perform automatic
model selection (namely the choice of tuning parameterγ and — where it applies — the kernel parameter
σ). The tuning procedure makes use of coupled simulated annealing [50] to search for tuning parameters
and use10−fold cross-validated misclassification error as an optimization criterion. The performance
was then measured on an independent test set in terms of accuracy of predicting the target variables.
More precisely we report the area under the ROC curve (AUC) obtained on a set of 200 testing pairs
(the same set across different experiments). In Table I we reported the mean (with standard deviation in
parenthesis) of AUC’s obtained by100 models estimated based upon random drawn ofDN for different
values ofN and 2 values ofη (corresponding to 2 different noise levels). Additionallywe reported in
Figure 1 the box-plots of AUC’s forsub−C.
10It is worthy remarking at this point that the cumulants of Section II-C were defined for zero-mean vector processes.
February 21, 2011 DRAFT
18
TABLE I: Accuracy on test data for100 models estimated based upon random drawn ofDN .
AUC’s, η = 1
kernel function N = 24 N = 48 N = 72 N = 96sub−C 0.86(0.06) 0.90(0.02) 0.91(0.01) 0.91(0.01)
lin−C 0.55(0.05) 0.55(0.04) 0.55(0.02) 0.55(0.01)RBF−C 0.68(0.06) 0.70(0.03) 0.70(0.01) 0.71(0.01)
lin 0.53(0.04) 0.54(0.03) 0.56(0.03) 0.57(0.01)RBF 0.57(0.09) 0.55(0.08) 0.60(0.09) 0.61(0.09)
AUC’s, η = 10
kernel function N = 24 N = 48 N = 72 N = 96sub−C 0.61(0.05) 0.64(0.06) 0.68(0.03) 0.69(0.01)
lin−C 0.49(0.03) 0.49(0.03) 0.49(0.02) 0.49(0.01)RBF−C 0.51(0.03) 0.52(0.04) 0.50(0.02) 0.50(0.02)
lin 0.51(0.03) 0.51(0.03) 0.52(0.02) 0.52(0.01)RBF 0.51(0.02) 0.52(0.03) 0.50(0.01) 0.50(0.01)
η = 10, sub−C
0.4
0.5
0.6
0.7
0.8
0.9
1
24 48 72 96
(a)
η = 1, sub−C
0.4
0.5
0.6
0.7
0.8
0.9
1
24 48 72 96
(b)
Fig. 1: Synthetic example: boxplots of AUC’s on test data obtained for increasing number of trainingpatterns. Plot 1(a) refers to high level of noise (η = 10), plot 1(b) to low noise condition (η = 1).
B. Biomag 2010 Data
As a real life application we considered a brain decoding task where the goal is to discriminate the
different mental states of a subject performing a spatial attention task. The experiment was designed
in order to investigate whether a subject could modulate hisor her attention towards a specific spatial
direction (e.g., left or right) so that the related brain activity could be used as the control signal in a
brain-computer interface (BCI) system [46], [27]. Brain data were recorded by means of a magnetoen-
cephalography (MEG) system and distributed by the Biomag 2010 data analysis competition11. The
11Dataset was retrieved at ftp://ftp.fcdonders.nl/pub/courses/biomag2010/competition1-dataset.
February 21, 2011 DRAFT
19
experimental setting of the recordings consisted of multiple repetitions of a basic block, calledtrial , of
the same stimulation protocol. In each trial a fixation crosswas presented on a screen to the subject
and after that a visual cue (an arrow) indicating which direction, either left or right, the subject had to
covertly attend to during the next2500 ms [46]. Data were collected through274 sensors at1200 Hz
for 255 trials per subject and refer to4 subjects12. Two standard pre-processing steps were operated on
all timeseries, specifically: first downsampling from1200 to 300 Hz and secondly band-pass filtering in
order to remove frequency components outside the range5−70 Hz. The brain activity of interest related
to attention tasks is known to fall within that range [46]. Inthis work we present results for subject1
in the original study. We restricted our analysis to a subsetof 45 out of the274 sensors/timeseries of
the initial dataset corresponding to the parietal areas of the brain based on previous findings [46], [27].
Moreover, as a consequence of this selection, the computational cost of computing the cumulant-based
kernel was greatly reduced. As in [46], we restricted our analysis to the last2000 ms of each trial to
remove artifacts occurring immediately after the stimulusoffset. The final dataset consisted of255 trials,
127 corresponding to theleft stimulus and128 to theright stimulus. Each trial consisted of45 timeseries
of 600 time instants.
Additionally, we constructed a second dataset in order to reproduce the state-of-the-art analysis pro-
posed in [46] and compare results against the proposed method. In this second dataset consisted of the
trials were described by the the values of the power spectraldensity (PSD) in the5 − 70 Hz range of
the timeseries instead of the timeseries themselves. This dataset consisted of255 trials each described
by 45 PSDs where each PSD was a vector of129 values.
1) Linearity and Gaussianity Tests:As discussed in previous Sections the proposed cumulant-based
kernel approach does not rely on specific model assumptions.Even so, the case of state-space models
driven by non-Gaussian white processes is amenable to insightful interpretations. Hinich [24] developed
algorithms to test for Gaussianity and linearity of scalar processes. The main idea is based on the two
dimensional Fourier transform of the third order cumulant (8) (calledbispectrum). A simple chi-square
test can be derived to check for Gaussianity [24]. Additionally, for the case where there is indication of
non-Gaussianity one can test for linearity [24]. Here we consider the implementation of the tests given
in the Higher Order Spectral Analysis Toolbox of MATLAB [45]. Figure 2 reports the pattern obtained
from testing the scalar processes corresponding to each oneof the 45 channels and each trial of the 255
available. The presence of a marker (a dot) in correspondence to an entry denotes that the Gaussianity
12The Biomag dataset corresponds to subjects1, 2, 3 and7 respectively in [46].
February 21, 2011 DRAFT
20
1
15
30
45
0 50 100 150 200 250
Fig. 2: Plot obtained from testing each scalar processes forGaussianity for the Biomag Data. The presenceof a marker (a dot) in correspondence to an entry denotes thatthe Gaussianity hypothesis for that scalarprocess can be rejected with confidence.
hypothesis for that scalar process can be rejected with confidence. The analysis shows that96% of the
255 multivariate processes contains at least one componentthat is likely to be non-Gaussian and hence
are plausibly non-Gaussian vector processes. Moreover for82% of the non-Gaussian scalar processes the
linearity hypothesis cannot be rejected13.
2) AUC’s Corresponding to Different Kernel Functions:Out of the 255 recordings available 100 were
selected at random for testing purposes. Within the remaining patternsN were extracted at random in
repeated experiments to form the training setDN . We tested the same kernel functions (33)-(36) as for the
synthetic experiment above as well as the proposed kernel function (27) (sub-C) within the LS-SVMlab
toolbox. Additionally we tested the kernel function (36) onthe PSD dataset in order to reproduce the
analysis in [46]. We denote this kernel function as PSD-lin in the following. Among them only sub-C, lin
and PSD-lin showed behaviors that are better than random guessing. We argue that the estimation of the
remaining kernel classifiers suffered from model selectionissues arising from the high dimensionality of
the data. We reported the results in Table II and the corresponding box-plots in Figure 3.
TABLE II: AUC’s measured on test data for20 models estimated based upon random drawn ofDN .
kernel function N = 10 N = 30 N = 50 N = 70 N = 90 N = 110 N = 130 N = 150sub−C 0.66(0.11) 0.71(0.11) 0.81(0.03) 0.83(0.02) 0.84(0.02) 0.84(0.01) 0.84(0.01) 0.85(0.01)
lin 0.52(0.06) 0.50(0.06) 0.52(0.06) 0.53(0.06) 0.55(0.06) 0.56(0.04) 0.59(0.05) 0.60(0.03)PSD-lin 0.55(0.05) 0.67(0.06) 0.74(0.07) 0.75(0.04) 0.80(0.03) 0.81(0.03) 0.82(0.02) 0.84(0.01)
VI. CONCLUSION
The scientific literature has recently shown that tensor representations of structured data are partic-
ularly effective for improving generalization of learningmachines. This is true in particular when the
13Notice, however, that this fact does not imply that the processes are linear with high confidence.
February 21, 2011 DRAFT
21
PSD− lin
0.45
0.5
0.55
0.6
0.65
0.75
0.85
0.7
0.8
0.9
42
10 30 50 70 90 110 130 150
(a)
sub−C
0.45
0.5
0.55
0.6
0.65
0.75
0.85
0.7
0.8
0.9
10 30 50 70 90 110 130 150
(b)
Fig. 3: Real-life example: boxplots of AUC’s for the best performing kernel functions.
number of training observations is limited. On the other hand kernels lead to flexible models that have
been proven successful in many different contexts. Studying the interplay of the two domains recently
stimulated the interest of the machine learning community.In this spirit this paper investigated the
use of a kernel function that exploits the spectral information of cumulant tensors. These higher order
statistics are estimated directly from observed measurements and do not require neither specific modeling
assumptions nor a foregoing identification step. This translates into a fully data-driven approach to build
a classifier for the discrimination of multivariate signalsthat significantly outperformed in experiments
simpler kernel approaches as well as a state-of-the-art technique. Additionally, as we have shown, an
insightful connection with the dynamics of the generating systems can be drawn under specific modeling
assumptions. This is of particular interest for the case of EG signals that we have considered as an
application.
ACKNOWLEDGMENT
Research supported by Research Council KUL: CIF1 and STRT1/08/023 (2), GOA/11/05 Ambiorics, GOA/10/09 MaNet, CoE EF/05/006 Optimization in Engineering (OPTEC)
en PFV/10/002 (OPTEC), IOF-SCORES4CHEM, several PhD/postdoc and fellow grants; Flemish Government: FWO: PhD/postdoc grants, projects: G0226.06 (cooperative systems and
optimization), G0321.06 (Tensors), G.0427.10N, G.0302.07 (SVM/Kernel), G.0557.08 (Glycemia2), G.0588.09 (Brain-machine) research communities (WOG: ICCoS, ANMMM, MLDM).
IWT: PhD Grants, Eureka-Flite+, SBO LeCoPro, SBO Climaqs, SBO POM, O&O-Dsquare. Belgian Federal Science Policy Office:IUAP P6/04 (DYSCO, Dynamical systems, control and
optimization, 2007-2011). IBBT. EU: ERNSI; FP7-HD-MPC (INFSO-ICT-223854), COST intelliCIS, FP7-EMBOCON (ICT-248940). Contract Research: AMINAL.
REFERENCES
[1] NIPS Workshop on Tensors, Kernels, and Machine Learning(TKML). Whistler, BC, Canada, 2010.
[2] N. Aronszajn. Theory of reproducing kernels.Transactions of the American Mathematical Society, 68:337–404, 1950.
February 21, 2011 DRAFT
22
[3] A. Bissacco, A. Chiuso, and S. Soatto. Classification andrecognition of dynamical models: The role of phase, independent
components, kernels and optimal transport.IEEE Trans. on Pattern Analysis and Machine Intelligence, pages 1958–1972,
2007.
[4] C. Bourin and P. Bondon. Efficiency of high-order moment estimates.IEEE Trans. on Signal Processing, 46(1):255–258,
1998.
[5] S. L. Bressler, C. G. Richter, Y. Chen, and M. Ding. Cortical functional network organization from autoregressive modeling
of local field potential oscillations.Statistics in Medicine, 26(21):3875–3885, 2007.
[6] D. R. Brillinger and M. Rosenblatt.Computation and interpretation ofk−th order spectra, pages 189–232. Advanced
Seminar on Spectral Analysis of Time Series, University of Wisconsin–Madison, 1966. John Wiley, New York, 1967.
[7] B. L. P. Cheung, B. A. Riedner, G. Tononi, and B.; Van Veen.Estimation of cortical connectivity from EEG using
state-space models.IEEE Trans. on Biomedical Engineering, 57(9):2122–2134, 2010.
[8] P. Comon and C. Jutten.Handbook of Blind Source Separation: Independent component analysis and applications.
Academic Press, 2010.
[9] C. Cortes and V. Vapnik. Support vector networks.Machine Learning, 20:273–297, 1995.
[10] F. Cucker and S. Smale. On the mathematical foundationsof learning.Bulletin American Mathematical Society, 39(1):1–50,
2002.
[11] K. De Brabanter, P. Karsmakers, F. Ojeda, C. Alzate, J. De Brabanter, K. Pelckmans, B. De Moor, J. Vandewalle, and
J. A. K. Suykens. LS-SVMlab toolbox user’s guide version 1.7. Internal Report 10-146, ESAT-SISTA, K.U.Leuven (Leuven,
Belgium), 2010.
[12] K. De Cock and B. De Moor. Subspace angles between ARMA models. Systems & Control Letters, 46(4):265–270, 2002.
[13] L. De Lathauwer. Characterizing higher-order tensorsby means of subspaces.Internal Report 11-32, ESAT-SISTA, K.U.
Leuven (Leuven, Belgium), 2011.
[14] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition.SIAM Journal on Matrix
Analysis and Applications, 21(4):1253–1278, 2000.
[15] L. De Lathauwer, B. De Moor, and J. Vandewalle. An introduction to independent component analysis.Journal of
Chemometrics, 14(3):123–149, 2000.
[16] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints.SIAM Journal on
Matrix Analysis and Applications, 20(303), 1998.
[17] G. Favier and A.Y. Kibangou. Tensor-based methods for system identification. International Journal of Sciences and
Techniques of Authomatic Control Computer Engineering (IJ-STA), 3(1):870–889, 2009.
[18] A. Ferreol and P. Chevalier. On the behavior of current second and higher order blind source separation methods for
cyclostationary sources.IEEE Trans. on Signal Processing, 48(6):1712–1725, 2002.
[19] J. A. R. Fonollosa. Sample cumulants of stationary processes: asymptotic results.IEEE Trans. on Signal Processing,
43(4):967–977, 2002.
[20] G. B. Giannakis, Y. Inouye, and J. M. Mendel. Cumulant based identification of multichannel moving-average models.
IEEE Trans. on Automatic Control, 34(7):783–787, 2002.
[21] G. H. Golub and C. F. Van Loan.Matrix Computations. Johns Hopkins University Press, third edition, 1996.
[22] E. J. Hannan.Multiple time series. John Wiley & Sons, Inc. New York, NY, USA, 1971.
[23] X. He, D. Cai, and P. Niyogi. Tensor subspace analysis. In Advances in Neural Information Processing Systems (NIPS),
2006, pages 499–506.
February 21, 2011 DRAFT
23
[24] M. J. Hinich. Testing for Gaussianity and linearity of astationary time series.Journal of time series analysis, 3(3):169–176,
1982.
[25] M. W. Kadous. Temporal classification: extending the classification paradigm to multivariate time series. PhD thesis,
School of Computer Science and Engineering, The Universityof New South Wales, 2002.
[26] T. Kailath, A. H. Sayed, and B. Hassibi.Linear estimation. Prentice Hall Upper Saddle River, NJ, USA, 2000.
[27] S. P. Kelly, E. Lalor, R. B. Reilly, and J. J. Foxe. Independent brain computer interface control using visual spatial
attention-dependent modulations of parieto-occipital alpha. InConference Proceedings of2−nd International IEEE EMBS
Conference on Neural Engineering, pages 667–670, 2005.
[28] H. A. L. Kiers. Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics,
14(3):105–122, 2000.
[29] G. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions.J. Math. Anal. Applic., 33:82–95, 1971.
[30] J. Levine. Finite dimensional filters for a class of nonlinear systems and immersion in a linear system.SIAM Journal on
Control and Optimization, 25:1430, 1987.
[31] J. Liang and Z. Ding. Blind MIMO system identification based on cumulant subspace decomposition.IEEE Trans. on
Signal Processing, 51(6):1457–1468, 2003.
[32] R. J. Martin. A metric for ARMA processes.IEEE Trans. on Signal Processing, 48(4):1164–1170, 2002.
[33] C. L. Nikias and J. M. Mendel. Signal processing with higher-order spectra.Signal Processing Magazine, IEEE, 10(3):10–
37, 2002.
[34] C. L. Nikias and A. P. Petropulu.Higher-order spectra analysis: a nonlinear signal processing framework. Prentice Hall
Upper Saddle River, NJ, USA, 1993.
[35] M. B. Priestley.Spectral analysis and time series. Academic Press, 1981.
[36] P. A. Regalia and S. K. Mitra. Kronecker products, unitary matrices and signal processing applications.SIAM review,
31(4):586–613, 1989.
[37] B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem.Proceedings of the Annual Conference
on Computational Learning Theory (COLT), pages 416–426, 2001.
[38] M. Signoretto, L. De Lathauwer, and J. A. K. Suykens. A kernel-based framework to tensorial data analysis.Internal
Report 10-251, ESAT-SISTA, K.U. Leuven (Leuven, Belgium), 2010.
[39] I. Steinwart and A. Christmann.Support vector machines. Springer Verlag, 2008.
[40] J. A. K. Suykens, C. Alzate, and K. Pelckmans. Primal anddual model representations in kernel-based learning.Statistics
Surveys, 4:148–183 (electronic). DOI: 10.1214/09–SS052, 2010.
[41] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle.Least squares support vector machines.
World Scientific, 2002.
[42] J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifiers.Neural Processing Letters, 9(3):293–
300, 1999.
[43] A. Swami, G. Giannakis, and S. Shamsunder. Multichannel ARMA processes. IEEE Trans. on Signal Processing,
42(4):898–913, 2002.
[44] A. Swami and J. M. Mendel. Time and lag recursive computation of cumulants from a state-space model.IEEE Trans.
on Automatic Control, 35(1):4–17, 1990.
[45] A. Swami, J. M. Mendel, and C. L. Nikias. Higher-order spectral analysis toolbox for use with Matlab. Natick, M.A. :
The Mathworks Inc., 1995.
February 21, 2011 DRAFT
24
[46] M. Van Gerven and O. Jensen. Attention modulations of posterior alpha as a control signal for two-dimensional brain-
computer interfaces.Journal of Neuroscience Methods, 179(1):78–84, April 2009.
[47] M. Vasilescu and D. Terzopoulos. Multilinear analysisof image ensembles: Tensor faces. In7th European Conference
on Computer Vision (ECCV), in Computer Vision – ECCV 2002, Lecture Notes in Computer Science, volume 2350, pages
447–460, 2002, 2002.
[48] S. V. N. Vishwanathan, A. J. Smola, and R. Vidal. Binet-Cauchy kernels on dynamical systems and its application to the
analysis of dynamic scenes.International Journal of Computer Vision, 73(1):95–119, 2007.
[49] H. O. A. Wold. A study in the analysis of stationary time series. Almqvist & Wiksell, 1954.
[50] S. Xavier de Souza, J.A.K. Suykens, J. Vandewalle, and D. Bolle. Coupled simulated annealing.IEEE Trans. on Systems,
Man and Cybernetics-Part B, 40(2):320–335, 2010.
February 21, 2011 DRAFT