Prior Knowledge and Sparse Methods for Convolved Multiple Outputs Gaussian Processes

78
Prior Knowledge and Sparse Methods for Convolved Multiple Outputs Gaussian Processes Mauricio A. Álvarez Joint work with Neil D. Lawrence, David Luengo and Michalis K. Titsias School of Computer Science University of Manchester (University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 1 / 51

Transcript of Prior Knowledge and Sparse Methods for Convolved Multiple Outputs Gaussian Processes

Prior Knowledge and Sparse Methods for ConvolvedMultiple Outputs Gaussian Processes

Mauricio A. Álvarez

Joint work with Neil D. Lawrence, David Luengo and Michalis K. Titsias

School of Computer ScienceUniversity of Manchester

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 1 / 51

Contents

q Latent force models.

q Sparse approximations for latent force models.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 2 / 51

Data driven paradigm

q Traditionally, the main focus in machine learning has been modelgeneration through a data driven paradigm.

q Combine a data set with a flexible class of models and, throughregularization, make predictions on unseen data.

q Problems– Data is scarce relative to the complexity of the system.– Model is forced to extrapolate.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 3 / 51

Mechanistic models

q Models inspired by the underlying knowledge of a physical system arecommon in many areas.

q Description of a well characterized physical process that underpins thesystem, typically represented with a set of differential equations.

q Identifying and specifying all the interactions might not be feasible.

q A mechanistic model can enable accurate prediction in regions wherethere may be no available training data

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 4 / 51

Hybrid systems

q We suggest a hybrid approach involving a mechanistic model of thesystem augmented through machine learning techniques.

q Dynamical systems (e.g. incorporating first order and second orderdifferential equations).

q Partial differential equations for systems with multiple inputs.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 5 / 51

Latent variable model: definition

q Our approach can be seen as a type of latent variable model.

Y = UW + E,

where Y ∈ RN×D, U ∈ RN×Q , W ∈ RQ×D (Q < D) and E is a matrixvariate white Gaussian noise with columns e:,d ∼ N (0,Σ).

q In PCA and FA the common approach to deal with the unknowns is tointegrate out U under a Gaussian prior and optimize with respect to W.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 6 / 51

Latent variable model: alternative view

q Data with temporal nature and Gaussian (Markov) prior for rows of Uleads to the Kalman filter/smoother.

q Consider a joint distribution for p (U|t), t = [t1 . . . tN ]>, with the form of aGaussian process (GP),

p (U|t) =Q∏

q=1

N(u:,q |0,Ku:,q ,u:,q

).

The latent variables are random functions, {uq(t)}Qq=1 with associated

covariance Ku:,q ,u:,q .

q The GP for Y can be readily implemented. In [TSJ05] this is known as asemi-parametric latent factor model (SLFM).

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 7 / 51

Latent force model: mechanistic interpretation (1)

q We include a further dynamical system with a mechanistic inspiration.

q Reinterpret equation Y = UW + E, as a force balance equation

YB = US + E,

where S ∈ RQ×D is a matrix of sensitivities, B ∈ RD×D is diagonal matrixof spring constants, W = SB−1 and e:,d ∼ N

(0,B>ΣB

).

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 8 / 51

Latent force model: mechanistic interpretation (2)

Bd

yd (t)

U(t)

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 9 / 51

Latent force model: mechanistic interpretation (2)

U(t)

Sd1

Sd2

SdQ

u1(t)

u2(t)

uQ(t)

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 9 / 51

Latent force model: mechanistic interpretation (2)

U(t)

Sd1

Sd2

SdQ

u1(t)

u2(t)

uQ(t)

Bd

yd (t)

YB = US + E

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 9 / 51

Latent force model: extension (1)

q The model can be extended including dampers and masses.

q We can writeYB + YC + YM = US + E ,

whereY is the first derivative of Y w.r.t. timeY is the second derivative of Y w.r.t. timeC is a diagonal matrix of damping coefficientsM is a diagonal matrix of massesE is a matrix variate white Gaussian noise.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 10 / 51

Latent force model: extension (2)

Bd

yd (t)

Cd

md

U(t)

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 11 / 51

Latent force model: extension (2)

U(t)

Sd1

Sd2

SdQ

u1(t)

u2(t)

uQ(t)

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 11 / 51

Latent force model: extension (2)

U(t)

Sd1

Sd2

SdQ

u1(t)

u2(t)

uQ(t)

Bd

yd (t)

Cd

md

YB + YC + YM = US + E

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 11 / 51

Latent force model: properties

q This model allows to include behaviors like inertia and resonance.

q We refer to these systems as latent force models (LFMs).

q One way of thinking of our model is to consider puppetry.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 12 / 51

Second Order Dynamical System

Using the system of second order differential equations

mdd2yd (t)

dt2 + Cddyd (t)

dt+ Bdyd (t) =

Q∑q=1

Sdquq(t),

whereuq(t) latent forcesyd (t) displacements over time

Cd damper constant for the d-th outputBd spring constant for the d-th outputmd mass constant for the d-th outputSdq sensitivity of the d-th output to the q-th input.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 13 / 51

Second Order Dynamical System: solution

Solving for yd (t), we obtain

yd (t) =Q∑

q=1

Ldq[uq](t),

where the linear operator is given by a convolution:

Ldq[uq](t) =Sdq

ωd

∫ t

0exp(−αd (t − τ)) sin(ωd (t − τ))uq(τ)dτ ,

with ωd =√

4Bd − C2d/2 and αd = Cd/2.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 14 / 51

Second Order Dynamical System: covariance matrix

Behaviour of the system summarized by the damping ratio:

ζd =12

Cd/√

Bd

ζd > 1 overdamped systemζd = 1 critically damped systemζd < 1 underdamped systemζd = 0 undamped system (no friction)

Example covariance matrix:

ζ1 = 0.125 underdampedζ2 = 2 overdampedζ3 = 1 critically damped

f(t) y1(t) y

2(t) y

3(t)

f(t)

y 1(t)

y 2(t)

y 3(t)

−0.4

−0.2

0

0.2

0.4

0.6

0.8

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 15 / 51

Second Order Dynamical System: samples from GP

0 5 10 15 20−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Joint samples from the ODE covariance, cyan: u (t), red: y1 (t)(underdamped)and green: y2 (t) (overdamped) and blue: y3 (t) (critically damped).

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 16 / 51

Second Order Dynamical System: samples from GP

0 5 10 15 20−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Joint samples from the ODE covariance, cyan: u (t), red: y1 (t)(underdamped)and green: y2 (t) (overdamped) and blue: y3 (t) (critically damped).

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 16 / 51

Second Order Dynamical System: samples from GP

0 5 10 15 20−1

−0.5

0

0.5

1

1.5

2

Joint samples from the ODE covariance, cyan: u (t), red: y1 (t)(underdamped)and green: y2 (t) (overdamped) and blue: y3 (t) (critically damped).

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 16 / 51

Second Order Dynamical System: samples from GP

0 5 10 15 20−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Joint samples from the ODE covariance, cyan: u (t), red: y1 (t)(underdamped)and green: y2 (t) (overdamped) and blue: y3 (t) (critically damped).

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 16 / 51

Motion Capture Data (1)

q CMU motion capture data, motions 18, 19 and 20 from subject 49.

q Motions 18 and 19 for training and 20 for testing.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 17 / 51

Motion Capture Data (2)

q The data down-sampled by 32 (from 120 frames per second to 3.75).

q We focused on the subject’s left arm.

q For testing, we condition only on the observations of the shoulder’sorientation (motion 20) to make predictions for the rest of the arm’sangles.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 18 / 51

Motion Capture Results

Root mean squared (RMS) angle error for prediction of the left arm’sconfiguration in the motion capture data. Prediction with the latent force modeloutperforms the prediction with regression for all apart from the radius’s angle.

Latent Force RegressionAngle Error ErrorRadius 4.11 4.02Wrist 6.55 6.65

Hand X rotation 1.82 3.21Hand Z rotation 2.76 6.14

Thumb X rotation 1.77 3.10Thumb Z rotation 2.73 6.09

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 19 / 51

Diffussion in the Swiss Jura

Lead

Cadmium

CopperCopper

Region ofSwiss Jura

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 20 / 51

Diffussion in the Swiss Jura

Lead

Cadmium

CopperCopper

Region ofSwiss Jura

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 20 / 51

Diffussion in the Swiss Jura

Lead

Cadmium

CopperCopper

Region ofSwiss Jura

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 20 / 51

Diffussion in the Swiss Jura

Lead

Cadmium

CopperCopper

Region ofSwiss Jura

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 20 / 51

Diffusion equation

q A simplified version of the diffusion equation is

∂yd (x, t)∂t

=

p∑j=1

κd∂2yd (x, t)∂x2

j,

where yd (x, t) are the concentrations of each pollutant.

q The solution to the system is then given by

yd (x, t) =Q∑

q=1

Sdq

∫Rp

Gd (x,x′, t)uq(x′)dx′,

where uq(x) represents the concentration of pollutants at time zero andGd (x,x′, t) is the Green’s function given as

Gd (x,x′, t) =1

2pπp/2T p/2d

exp

− p∑j=1

(xj − x ′j )2

4Td

,with Td = κd t .

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 21 / 51

Prediction of Metal Concentrations

q Prediction of a primary variable by conditioning on the values of somesecondary variables.

Primary variable Secondary VariablesCd Ni, ZnCu Pb, Ni, ZnPb Cu, Ni, ZnCo Ni, Zn

q Comparison bewteen diffusion kernel, independent GPs and “ordinaryco-kriging”.

Metals IGPs GPDK OCKCd 0.5823±0.0133 0.4505±0.0126 0.5Cu 15.9357±0.0907 7.1677±0.2266 7.8Pb 22.9141±0.6076 10.1097±0.2842 10.7Co 2.0735±0.1070 1.7546±0.0895 1.5

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 22 / 51

LFM in the context of convolution processes

q Consider a set of functions {fd (x)}Dd=1.

q Each function can be expressed as

fd (x) =

∫X

Gd (x− z)u(z)dz = Gd (x) ∗ u(x).

q Influence of more than one latent function, {uq(z)}Qq=1 and inclusion of an

independent process wd (x)

yd (x) = fd (x) + wd (x) =Q∑

q=1

∫X

Gdq(x− z)uq(z)dz + wd (x).

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 23 / 51

A pictorial representation

u(x)

u(x): latent function.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 24 / 51

A pictorial representation

u(x)

G (x)1

G (x)2

u(x): latent function.

G(x): smoothing kernel.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 24 / 51

A pictorial representation

u(x)

G (x)1

G (x)2

(f x)2

(f x)1

u(x): latent function.

G(x): smoothing kernel.

f(x): output function.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 24 / 51

A pictorial representation

u(x)

G (x)1

G (x)2

(f x)2

(f x)1(w x)1

(w x)2

u(x): latent function.

G(x): smoothing kernel.

f(x): output function.

w(x): independent process.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 24 / 51

A pictorial representation

u(x)

G (x)1

G (x)2

(f x)2

(f x)1(w x)1

(w x)2

(y x)1

(y x)2

u(x): latent function.

G(x): smoothing kernel.

f(x): output function.

w(x): independent process.

y(x): noisy output function.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 24 / 51

Covariance of the output functions.

The covariance between yd (x) and yd ′(x′) is given as

cov [yd (x), yd ′(x′)] = cov [fd (x), fd ′(x′)] + cov [wd (x),wd ′(x′)] δd,d ′ ,

where cov [fd (x), fd ′(x′)]

Q∑q=1

Q∑q′=1

∫X

Gdq(x− z)

∫X

Gd ′q′(x′ − z′) cov [uq(z),uq′(z′)] dz′dz

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 25 / 51

Likelihood of the full Gaussian process.

q The likelihood of the model is given by

p(y|X,φ) = N (0,Kf,f + Σ)

where y =[y>1 , . . . ,y

>D

]> is the set of output functions, Kf,f covariancematrix with blocks cov [fd , fd ′ ], Σ matrix of noise variances, φ is the set ofparameters of the covariance matrix and X = {x1, . . . ,xN} is the set ofinput vectors.

q Learning from the log-likelihood involves the inverse of Kf,f + Σ, whichgrows with complexity O(N3D3)

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 26 / 51

Predictive distribution of the full Gaussian process.

q Predictive distribution at X∗

p(y∗|y,X,X∗,φ) = N (µ∗,Λ∗)

with

µ∗ = Kf∗,f(Kf,f + Σ)−1y

Λ∗ = Kf∗,f∗ − Kf∗,f(Kf,f + Σ)−1Kf,f∗ + Σ

q Prediction is O(ND) for the mean and O(N2D2) for the variance.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 27 / 51

Conditional prior distribution.

Sample from p(u) fd (x) =

∫X

Gd (x− z)u(z)dz

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 28 / 51

Conditional prior distribution.

Sample from p(u) fd (x) =

∫X

Gd (x− z)u(z)dz

Discretize u fd (x) ≈∑∀k

Gd (x− zk )u(zk )

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 28 / 51

Conditional prior distribution.

Sample from p(u) fd (x) =

∫X

Gd (x− z)u(z)dz

Discretize u fd (x) ≈∑∀k

Gd (x− zk )u(zk )

Sample fromp(u|u)

fd (x) ≈∫X

Gd (x− z) E [u(z)|u] dz

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 28 / 51

The conditional independence assumption I.

q This form for fd (x) leads to the following likelihood

p(f|u,Z) = N(f|Kf,uK−1

u,uu,Kf,f − Kf,uK−1u,uKu,f

),

whereu discrete sample from the latent functionZ set of input vectors corresponding to u

Ku,u cross-covariance matrix between latent functionsKf,u = K>u,f cross-covariance matrix between latent and output functions

q Even though we conditioned on u, we still have dependencies betweenoutputs due to the uncertainty in p(u|u).

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 29 / 51

The conditional independence assumption II.

Our key assumption is that the outputs will be independent even if wehave only observed u rather than the whole function u.

Kf1f1_ Kuu

−1Kf1u Kuf1

Kf2f2_ Kuu

−1Kf2u Kuf2

Kf3f3_ Kuu

−1Kf3u Kuf3

Kf2f1_ Kuu

−1Kf2u Kuf1

Kf3f1_ Kuu

−1Kf3u Kuf1

Kf1f2_ Kuu

−1Kf1u Kuf2 Kf1f3

_ Kuu−1

Kf1u Kuf3

Kf2f3_ Kuu

−1Kf2u Kuf3

Kf3f2_ Kuu

−1Kf3u Kuf2

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 30 / 51

The conditional independence assumption II.

Our key assumption is that the outputs will be independent even if wehave only observed u rather than the whole function u.

Kf1f1_ Kuu

−1Kf1u Kuf1

Kf2f2_ Kuu

−1Kf2u Kuf2

Kf3f3_ Kuu

−1Kf3u Kuf3

Kf2f1_ Kuu

−1Kf2u Kuf1

Kf3f1_ Kuu

−1Kf3u Kuf1

Kf1f2_ Kuu

−1Kf1u Kuf2 Kf1f3

_ Kuu−1

Kf1u Kuf3

Kf2f3_ Kuu

−1Kf2u Kuf3

Kf3f2_ Kuu

−1Kf3u Kuf2

Kf1f1_ Kuu

−1Kf1u Kuf1

Kf2f2_ Kuu

−1Kf2u Kuf2

Kf3f3_ Kuu

−1Kf3u Kuf3

0

00

0

0

0

Better approximations can be obtained when E [u|u] approximates u.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 30 / 51

Comparison of marginal likelihoods

Integrating out u, the marginal likelihood is given as

p(y|Z,X,θ) =N(y|0,Kf,uK−1

u,uKu,f + blockdiag[Kf,f − Kf,uK−1

u,uKu,f]+ Σ

).

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 31 / 51

Comparison of marginal likelihoods

Integrating out u, the marginal likelihood is given as

p(y|Z,X,θ) =N(y|0,Kf,uK−1

u,uKu,f + blockdiag[Kf,f − Kf,uK−1

u,uKu,f]+ Σ

).

Kf1f1

Kf2f2

Kf3f3

Kuu

−1Kf2u

Kuf1

Kuu

−1Kf3u

Kuf1

Kuu

−1Kf1u

Kuf2Kuu

−1Kf1u

Kuf3

Kuu

−1Kf2u

Kuf3

Kuu

−1Kf3u

Kuf2

~~

Kf1f1

Kf2f1Kf2f2

Kf2f3

Kf1f2Kf1f3

Kf3f1Kf3f2

Kf3f3

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 31 / 51

Comparison of marginal likelihoods

Integrating out u, the marginal likelihood is given as

p(y|Z,X,θ) =N(y|0,Kf,uK−1

u,uKu,f + blockdiag[Kf,f − Kf,uK−1

u,uKu,f]+ Σ

).

Kf1f1

Kf2f2

Kf3f3

Kuu

−1Kf2u

Kuf1

Kuu

−1Kf3u

Kuf1

Kuu

−1Kf1u

Kuf2Kuu

−1Kf1u

Kuf3

Kuu

−1Kf2u

Kuf3

Kuu

−1Kf3u

Kuf2

~~

Kf1f1

Kf2f1Kf2f2

Kf2f3

Kf1f2Kf1f3

Kf3f1Kf3f2

Kf3f3

Kf1f1

Kf2f1 Kf2f2 Kf2f3

Kf1f2Kf1f3

Kf3f1 Kf3f2Kf3f3

~~ G GXT

Discrete case [G]i,k = Gd (xi − zk )

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 31 / 51

Predictive distribution for the sparse approximation

Predictive distribution

p(y∗|y,X,X∗,Z,θ) = N(µ∗, Λ∗

), with

µ∗ = Kf∗,uA−1Ku,f(D + Σ)−1y

Λ∗ = D∗ + Kf∗,uA−1Ku,f∗ + Σ

A = Ku,u + Ku,f(D + Σ)−1Kf,u

D∗ = blockdiag[Kf∗,f∗ − Kf∗,uK−1

u,uKu,f∗]

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 32 / 51

Remarks

q For learning the computational demand is in the calculation of theblock-diagonal term which grows as O(N3D) +O(NDM2) (with Q = 1).Storage is O(N2D) +O(NDM).

q For inference, the computation of the mean grows as O(DM) and thecomputation of the variance as O(DM2), after some pre-computationsand for one test point.

q The functional form of the approximation is almost identical to that of thePartially Independent Training Conditional (PITC) approximation [QR05].

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 33 / 51

Additional conditional independenciesq The N3 term in the computational complexity and the N2 term in

storage in PITC are still expensive for larger data sets.q An additional assumption is independence over the data points.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 34 / 51

Additional conditional independenciesq The N3 term in the computational complexity and the N2 term in

storage in PITC are still expensive for larger data sets.q An additional assumption is independence over the data points.

f1 f2 fD

u

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 34 / 51

Additional conditional independenciesq The N3 term in the computational complexity and the N2 term in

storage in PITC are still expensive for larger data sets.q An additional assumption is independence over the data points.

f1 f2 fD

u

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 34 / 51

Comparison of marginal likelihoods

The marginal likelihood is given as

p(y|Z,X,θ) =N(0,Kf,uK−1

u,uKu,f + diag[Kf,f − Kf,uK−1

u,uKu,f]+ Σ

).

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 35 / 51

Comparison of marginal likelihoods

The marginal likelihood is given as

p(y|Z,X,θ) =N(0,Kf,uK−1

u,uKu,f + diag[Kf,f − Kf,uK−1

u,uKu,f]+ Σ

).

Kf1f1

Kf2f2

Kf3f3

Kuu

−1Kf2u

Kuf1

Kuu

−1Kf3u

Kuf1

Kuu

−1Kf1u

Kuf2Kuu

−1Kf1u

Kuf3

Kuu

−1Kf2u

Kuf3

Kuu

−1Kf3u

Kuf2

~~

Kf1f1

Kf2f1Kf2f2

Kf2f3

Kf1f2Kf1f3

Kf3f1Kf3f2

Kf3f3

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 35 / 51

Comparison of marginal likelihoods

The marginal likelihood is given as

p(y|Z,X,θ) =N(0,Kf,uK−1

u,uKu,f + diag[Kf,f − Kf,uK−1

u,uKu,f]+ Σ

).

Kf1f1

Kf2f2

Kf3f3

Kuu

−1Kf2u

Kuf1

Kuu

−1Kf3u

Kuf1

Kuu

−1Kf1u

Kuf2Kuu

−1Kf1u

Kuf3

Kuu

−1Kf2u

Kuf3

Kuu

−1Kf3u

Kuf2

~~

Kf1f1

Kf2f1Kf2f2

Kf2f3

Kf1f2Kf1f3

Kf3f1Kf3f2

Kf3f3

Qf1f1

Qf2f2

Qf3f3

Kuu−1

Kf2u Kuf1

Kuu−1

Kf3u Kuf1

Kuu−1

Kf1u Kuf2Kuu

−1Kf1u Kuf3

Kuu−1

Kf2u Kuf3

Kuu−1

Kf3u Kuf2

~~

Kf1f1

Kf2f1 Kf2f2Kf2f3

Kf1f2Kf1f3

Kf3f1Kf3f2

Kf3f3

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 35 / 51

Comparison of marginal likelihoods

The marginal likelihood is given as

p(y|Z,X,θ) =N(0,Kf,uK−1

u,uKu,f + diag[Kf,f − Kf,uK−1

u,uKu,f]+ Σ

).

Kf1f1

Kf2f2

Kf3f3

Kuu

−1Kf2u

Kuf1

Kuu

−1Kf3u

Kuf1

Kuu

−1Kf1u

Kuf2Kuu

−1Kf1u

Kuf3

Kuu

−1Kf2u

Kuf3

Kuu

−1Kf3u

Kuf2

~~

Kf1f1

Kf2f1Kf2f2

Kf2f3

Kf1f2Kf1f3

Kf3f1Kf3f2

Kf3f3

Qf1f1

Qf2f2

Qf3f3

Kuu−1

Kf2u Kuf1

Kuu−1

Kf3u Kuf1

Kuu−1

Kf1u Kuf2Kuu

−1Kf1u Kuf3

Kuu−1

Kf2u Kuf3

Kuu−1

Kf3u Kuf2

~~

Kf1f1

Kf2f1 Kf2f2Kf2f3

Kf1f2Kf1f3

Kf3f1Kf3f2

Kf3f3

Qf1f1

Qf2f2

Qf3f3

Kuu−1

Kf2u Kuf1

Kuu−1

Kf3u Kuf1

Kuu−1

Kf1u Kuf2Kuu

−1Kf1u Kuf3

Kuu−1

Kf2u Kuf3

Kuu−1

Kf3u Kuf2

~~

Kf1f1

Kf2f1 Kf2f2Kf2f3

Kf1f2Kf1f3

Kf3f1Kf3f2

Kf3f3

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 35 / 51

Comparison of marginal likelihoods

The marginal likelihood is given as

p(y|Z,X,θ) =N(0,Kf,uK−1

u,uKu,f + diag[Kf,f − Kf,uK−1

u,uKu,f]+ Σ

).

Qf1f1

Qf2f2

Qf3f3

Kuu−1

Kf2u Kuf1

Kuu−1

Kf3u Kuf1

Kuu−1

Kf1u Kuf2Kuu

−1Kf1u Kuf3

Kuu−1

Kf2u Kuf3

Kuu−1

Kf3u Kuf2

~~

Kf1f1

Kf2f1 Kf2f2Kf2f3

Kf1f2Kf1f3

Kf3f1Kf3f2

Kf3f3

Qf1f1

Qf2f2

Qf3f3

Kuu−1

Kf2u Kuf1

Kuu−1

Kf3u Kuf1

Kuu−1

Kf1u Kuf2Kuu

−1Kf1u Kuf3

Kuu−1

Kf2u Kuf3

Kuu−1

Kf3u Kuf2

~~

Kf1f1

Kf2f1 Kf2f2Kf2f3

Kf1f2Kf1f3

Kf3f1Kf3f2

Kf3f3

Kf1f1(x ,x )1 1

Kuu−1

Kf1u K uf1 Kf1f1(x ,x )2 2

Kf1f1(x ,x )3 3

( )(x ,x )2 1

Kuu−1

Kf1u K uf1( )(x ,x )3 1

Kuu−1

Kf1u K uf1( )(x ,x )1 2 Kuu−1

Kf1u K uf1( )(x ,x )1 3

Kuu−1

Kf1u K uf1( )(x ,x )2 3

Kuu−1

Kf1u K uf1( )(x ,x )3 2

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 35 / 51

Comparison of marginal likelihoods

The marginal likelihood is given as

p(y|Z,X,θ) =N(0,Kf,uK−1

u,uKu,f + diag[Kf,f − Kf,uK−1

u,uKu,f]+ Σ

).

Kf1f1(x ,x )1 1

Kuu−1

Kf1u K uf1 Kf1f1(x ,x )2 2

Kf1f1(x ,x )3 3

( )(x ,x )2 1

Kuu−1

Kf1u K uf1( )(x ,x )3 1

Kuu−1

Kf1u K uf1( )(x ,x )1 2 Kuu−1

Kf1u K uf1( )(x ,x )1 3

Kuu−1

Kf1u K uf1( )(x ,x )2 3

Kuu−1

Kf1u K uf1( )(x ,x )3 2

Qf1,f1

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 35 / 51

Computational requirements

q The computational demand is now equal to O(NDM2). Storage isO(NDM).

q For inference, the computation of the mean grows as O(DM) and thecomputation of the variance as O(DM2), after some pre-computationsand for one test point.

q Similar to the Fully Independent Training Conditional (FITC)approximation [QR05, SG06].

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 36 / 51

Deterministic approximation

q We could also assume that given the latent functions the outputs aredeterministic.

q The marginal likelihood is given as

p(y|Z,X,θ) =N(0,Kf,uK−1

u,uKu,f + Σ).

q Computation complexity is the same as FITC.

q Deterministic training conditional approximation (DTC).

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 37 / 51

Examples

q For all our experiments we considered squared exponential covariancefunctions for the latent process of the form

ku,u(x,x′) = exp[−1

2(x− x′)> L (x− x′)

],

where L is a diagonal matrix which allows for different length-scalesalong each dimension.

q The smoothing kernel had the same form,

Gd (τ ) =Sd |Ld |1/2

(2π)p/2 exp[−1

2τ>Ldτ

],

where Sd ∈ R and Ld is a symmetric positive definite matrix.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 38 / 51

Examples: Artificial data 1DFour outputs generated from the full GP (D = 4).

−1 −0.5 0 0.5 1

−10

−5

0

5

10

x

y 4(x

)

y4(x) using the full GP

−1 −0.5 0 0.5 1

−10

−5

0

5

10

x

y 4(x

)

y4(x) using the DTC approximation

−1 −0.5 0 0.5 1

−10

−5

0

5

10

x

y 4(x

)

y4(x) using the FITC approximation

−1 −0.5 0 0.5 1

−10

−5

0

5

10

x

y 4(x

)

y4(x) using the PITC approximation

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 39 / 51

Artificial example (cont.)

Method SMSE y1(x) SMSE y2(x) SMSE y3(x) SMSE y4(x)Full GP 1.06± 0.08 0.99± 0.06 1.10± 0.09 1.05± 0.09

DTC 1.06± 0.08 0.99± 0.06 1.12± 0.09 1.05± 0.09FITC 1.06± 0.08 0.99± 0.06 1.10± 0.08 1.05± 0.08PITC 1.06± 0.08 0.99± 0.06 1.10± 0.09 1.05± 0.09

Standarized mean square error (SMSE). All numbers are to be multiplied by 10−2.

Method MSLL y1(x) MSLL y2(x) MSLL y3(x) MSLL y4(x)Full GP −2.27± 0.04 −2.30± 0.03 −2.25± 0.04 −2.27± 0.05

DTC −0.98± 0.18 −0.98± 0.18 −1.25± 0.16 −1.25± 0.16FITC −2.26± 0.04 −2.29± 0.03 −2.16± 0.04 −2.23± 0.05PITC −2.27± 0.04 −2.30± 0.03 −2.23± 0.04 −2.26± 0.05

Mean standardized log loss (MSLL). More negative values indicate better models.

Training times for iteration of each model are 1.97±0.02 secs for the full GP, 0.20±0.01secs for DTC, 0.41± 0.03 for FITC and 0.59± 0.05 for the PITC.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 40 / 51

Predicting school examination scores

q Multitask learning problem.

q The goal is to predict the exam score obtained by a particular studentdescribed by a set of 20 features belonging to a specific school (task).

q It consists of examination records from 139 secondary schools in years1985, 1986 and 1987.

q Features include year of the exam, gender, VR band and ethnic group foreach student, which are transformed to dummy variables.

q Dataset consists of 4004 samples. Ten repetitions with 75% training and25% testing.

q Gaussian smoothing function.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 41 / 51

Predicting school examination scores (cont.)

D5 D20 D50 F5 F20 F50 P5 P20 P50 ICM IND25

30

35

40

45

50

55

60

Method

Per

centa

geof

expla

ined

vari

ance

D: DTC. F: FITC. P: PITC. ICM: Intrinsic coregionalization model [BCW08]. IND: Independent GPs [BCW08].

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 42 / 51

A dynamic model for transcription regulation

q Microarray studies have made the simultaneous measurement of mRNAfrom thousands of genes practical.

q Transcription is governed by the presence of absence of transcriptionfactor proteins that act as switches to turn on and off the expression ofthe genes.

q The active concentration of these transcription factors is typically muchmore difficult to measure.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 43 / 51

A dynamic model for transcription regulation (cont.)

q There are Q transcription factors {uq(t)}Qq=1, each of them represented

through a Gaussian process, uq(t) ∼ GP(0, kuquq (t , t ′)

).

q Our model is based on the following differential equation [ALL09],

dfddt

= γd +Q∑

q=1

Sdquq(t)− Bd fd (t),

where γd is the basal transcription rate of gene d , Sdq is the sensitivity ofgene d to the transcription factor uq(t) and Bd is the decay rate of mRNA.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 44 / 51

A dynamic model for transcription regulation (cont.)

q Benchmark yeast cell cycle dataset of [SSZ+98].

q Data is preprocessed as described in [SLR06] with a final dataset of 1975genes and 104 transcription factors. There are 24 time points for eachgene.

q We optimize the marginal likelihood through scaled conjugate gradient.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 45 / 51

A dynamic model for transcription regulation (cont.)

5 10 15 200

0.5

1

1.5

2

2.5

3

3.5

Time

AC

E2

gene

expre

ssio

nle

vel

Gene expression profile for ACE2.

5 10 15 20

0

0.5

1

1.5

Time

AC

E2

Pro

tein

Lev

el

Protein concentration for ACE2.

5 10 15 200

1

2

3

4

5

Time

SW

I5ge

ne

expre

ssio

nle

vel

Gene expression profile for SWI5.

5 10 15 20−2

−1

0

1

2

Time

SW

I5P

rote

inLev

el

Protein concentration for SWI5.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 46 / 51

A dynamic model for transcription regulation (cont.)

−20 −15 −10 −5 0 5 10 15 20 250

5

10

15

20

SNR !Sd,q/!Sd,q

Num

ber

ofG

enes

SNR associated to ACE2.

−10 0 10 20 300

5

10

15

20

25

SNR !Sd,q/!Sd,q

Num

ber

ofG

enes

SNR associated to SWI5.

– For ACE2, highest SNR values are obtained for CTS1, SCW11, DSE1 and DSE2,while, for example, NCE4 appears to be repressed with a low SNR value([SSZ+98, SLR06]).

– SWI5 appears to activate genes AMN1 and PLC2 ([CLCB01]).

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 47 / 51

Swiss Jura example revisited

D50 D100 D200 D500 F50 F100 F200 F500 P50 P100 P200 P500 FGP CK IND0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

Method

Mea

nA

bso

lute

Err

orC

adm

ium

D: DTC. F: FITC. P: PITC. FGP: Full Gaussian Process. CK: Cokriging [Goo97]. IND: Independent GPs [BCW08].

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 48 / 51

Conclusions

q Hybrid approach for the use of simple mechanistic models with Gaussianprocesses.

q Convolution processes as a way to augment data-driven models withcharacteristics of physical systems.

q Gaussian process as meaningful prior distributions.

q Sparse approximations for multiple outputs convolved GP exploitingconditional independencies.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 49 / 51

Acknowledgments

q To Google, for a Google Research Award.

q To EPSRC, for Grant No EP/F005687/1 “Gaussian Processes forSystems Identification with Applications in Systems Biology”.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 50 / 51

References I

Mauricio Álvarez, David Luengo, and Neil D. Lawrence.

Latent Force Models.In David van Dyk and Max Welling, editors, Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, pages 9–16,Clearwater Beach, Florida, 16-18 April 2009. JMLR W&CP 5.

Edwin V. Bonilla, Kian Ming Chai, and Christopher K. I. Williams.

Multi-task Gaussian process prediction.In John C. Platt, Daphne Koller, Yoram Singer, and Sam Roweis, editors, NIPS, volume 20, Cambridge, MA, 2008. MIT Press.

Alejandro Colman-Lerner, Tina E. Chin, and Roger Brent.

Yeast cbk1 and mob2 activate daughter-specific genetic programs to induce asymmetric cell fates.Cell, 107:739–750, 2001.

Pierre Goovaerts.

Geostatistics For Natural Resources Evaluation.Oxford University Press, USA, 1997.

Joaquin Quiñonero Candela and Carl Edward Rasmussen.

A unifying view of sparse approximate Gaussian process regression.Journal of Machine Learning Research, 6:1939–1959, 2005.

Edward Snelson and Zoubin Ghahramani.

Sparse Gaussian processes using pseudo-inputs.In Yair Weiss, Bernhard Schölkopf, and John C. Platt, editors, NIPS, volume 18, Cambridge, MA, 2006. MIT Press.

Guido Sanguinetti, Neil D. Lawrence, and Magnus Rattray.

Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities.Bioinformatics, 22:2275–2281, 2006.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 51 / 51

References II

Paul T. Spellman, Gavin Sherlock, Michael Q. Zhang, Vishwanath R. Iyer, Kirk Anders, Michael B. Eisen, Patrick O. Brown, David Botstein, and

Bruce Futcher.Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.Molecular Biology of the Cell, 9(12):3273–3297, 1998.

Yee Whye Teh, Matthias Seeger, and Michael I. Jordan.

Semiparametric latent factor models.In Robert G. Cowell and Zoubin Ghahramani, editors, AISTATS 10, pages 333–340, Barbados, 6-8 January 2005. Society for Artificial Intelligenceand Statistics.

(University of Manchester) Prior Knowledge and Sparse Methods 12/12/2009 52 / 51