A new look at state-space models for neural data

20
J Comput Neurosci (2010) 29:107–126 DOI 10.1007/s10827-009-0179-x A new look at state-space models for neural data Liam Paninski · Yashar Ahmadian · Daniel Gil Ferreira · Shinsuke Koyama · Kamiar Rahnama Rad · Michael Vidne · Joshua Vogelstein · Wei Wu Received: 22 December 2008 / Revised: 6 July 2009 / Accepted: 16 July 2009 / Published online: 1 August 2009 © Springer Science + Business Media, LLC 2009 Abstract State space methods have proven indispens- able in neural data analysis. However, common meth- ods for performing inference in state-space models with non-Gaussian observations rely on certain approxima- tions which are not always accurate. Here we review direct optimization methods that avoid these approxi- mations, but that nonetheless retain the computational efficiency of the approximate methods. We discuss a variety of examples, applying these direct optimiza- tion techniques to problems in spike train smoothing, stimulus decoding, parameter estimation, and inference of synaptic properties. Along the way, we point out connections to some related standard statistical meth- ods, including spline smoothing and isotonic regression. Finally, we note that the computational methods re- viewed here do not in fact depend on the state-space Action Editor: Israel Nelken L. Paninski (B ) · Y. Ahmadian · D. G. Ferreira · K. Rahnama Rad · M. Vidne Department of Statistics and Center for Theoretical Neuroscience, Columbia University, New York, NY, USA e-mail: [email protected] URL: http://www.stat.columbia.edu/liam S. Koyama Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, USA J. Vogelstein Department of Neuroscience, Johns Hopkins University, Baltimore, MD, USA W. Wu Department of Statistics, Florida State University, Tallahassee, FL, USA setting at all; instead, the key property we are exploiting involves the bandedness of certain matrices. We close by discussing some applications of this more general point of view, including Markov chain Monte Carlo methods for neural decoding and efficient estimation of spatially-varying firing rates. Keywords Neural coding · State-space models · Hidden Markov model · Tridiagonal matrix 1 Introduction; forward-backward methods for inference in state-space models A wide variety of neuroscientific data analysis problems may be attacked fruitfully within the framework of hidden Markov (“state-space”) models. The basic idea is that the underlying system may be described as a stochastic dynamical process: a (possibly multidimen- sional) state variable q t evolves through time according to some Markovian dynamics p(q t |q t1 ,θ), as specified by a few model parameters θ . Now in many situations we do not observe the state variable q t directly (this Markovian variable is “hidden”); instead, our observa- tions y t are a noisy, subsampled version of q t , summa- rized by an observation distribution p( y t |q t ). Methods for performing optimal inference and esti- mation in these hidden Markov models are very well- developed in the statistics and engineering literature (Rabiner 1989; Durbin and Koopman 2001; Doucet et al. 2001). For example, to compute the conditional distribution p(q t |Y 1:T ) of the state variable q t given all the observed data on the time interval (0, T ], we need only apply two straightforward recursions: a forward recursion that computes the conditional dis-

Transcript of A new look at state-space models for neural data

J Comput Neurosci (2010) 29:107–126DOI 10.1007/s10827-009-0179-x

A new look at state-space models for neural data

Liam Paninski · Yashar Ahmadian ·Daniel Gil Ferreira · Shinsuke Koyama ·Kamiar Rahnama Rad · Michael Vidne ·Joshua Vogelstein · Wei Wu

Received: 22 December 2008 / Revised: 6 July 2009 / Accepted: 16 July 2009 / Published online: 1 August 2009© Springer Science + Business Media, LLC 2009

Abstract State space methods have proven indispens-able in neural data analysis. However, common meth-ods for performing inference in state-space models withnon-Gaussian observations rely on certain approxima-tions which are not always accurate. Here we reviewdirect optimization methods that avoid these approxi-mations, but that nonetheless retain the computationalefficiency of the approximate methods. We discuss avariety of examples, applying these direct optimiza-tion techniques to problems in spike train smoothing,stimulus decoding, parameter estimation, and inferenceof synaptic properties. Along the way, we point outconnections to some related standard statistical meth-ods, including spline smoothing and isotonic regression.Finally, we note that the computational methods re-viewed here do not in fact depend on the state-space

Action Editor: Israel Nelken

L. Paninski (B) · Y. Ahmadian · D. G. Ferreira ·K. Rahnama Rad · M. VidneDepartment of Statistics and Center for TheoreticalNeuroscience, Columbia University,New York, NY, USAe-mail: [email protected]: http://www.stat.columbia.edu/∼liam

S. KoyamaDepartment of Statistics, Carnegie Mellon University,Pittsburgh, PA, USA

J. VogelsteinDepartment of Neuroscience, Johns Hopkins University,Baltimore, MD, USA

W. WuDepartment of Statistics, Florida State University,Tallahassee, FL, USA

setting at all; instead, the key property we are exploitinginvolves the bandedness of certain matrices. We closeby discussing some applications of this more generalpoint of view, including Markov chain Monte Carlomethods for neural decoding and efficient estimation ofspatially-varying firing rates.

Keywords Neural coding · State-space models ·Hidden Markov model · Tridiagonal matrix

1 Introduction; forward-backward methodsfor inference in state-space models

A wide variety of neuroscientific data analysis problemsmay be attacked fruitfully within the framework ofhidden Markov (“state-space”) models. The basic ideais that the underlying system may be described as astochastic dynamical process: a (possibly multidimen-sional) state variable qt evolves through time accordingto some Markovian dynamics p(qt|qt−1, θ), as specifiedby a few model parameters θ . Now in many situationswe do not observe the state variable qt directly (thisMarkovian variable is “hidden”); instead, our observa-tions yt are a noisy, subsampled version of qt, summa-rized by an observation distribution p(yt|qt).

Methods for performing optimal inference and esti-mation in these hidden Markov models are very well-developed in the statistics and engineering literature(Rabiner 1989; Durbin and Koopman 2001; Doucetet al. 2001). For example, to compute the conditionaldistribution p(qt|Y1:T) of the state variable qt givenall the observed data on the time interval (0, T],we need only apply two straightforward recursions: aforward recursion that computes the conditional dis-

108 J Comput Neurosci (2010) 29:107–126

tribution of qt given only the observed data up totime t,

p(qt|Y1:t) ∝ p(yt|qt)

∫p(qt|qt−1)p(qt−1|Y1:t−1)dqt−1,

for t = 1, 2, . . . , T (1)

and then a backward recursion that computes the de-sired distribution p(qt|Y1:T),

p(qt|Y1:T) = p(qt|Y1:t)∫

p(qt+1|Y1:T)p(qt+1|qt)∫p(qt+1|qt)p(qt|Y1:t)dqt

dqt+1,

for t = T−1, T−2, . . . , 1, 0. (2)

Each of these recursions may be derived easily fromthe Markov structure of the state-space model. Inthe classical settings, where the state variable q is dis-crete (Rabiner 1989; Gat et al. 1997; Hawkes 2004;Jones et al. 2007; Kemere et al. 2008; Herbst et al. 2008;Escola and Paninski 2009), or the dynamics p(qt|qt−1)

and observations p(yt|qt) are linear and Gaussian, theserecursions may be computed exactly and efficiently:note that a full forward-backward sweep requires com-putation time which scales just linearly in the datalength T, and is therefore quite tractable even forlarge T. In the linear-Gaussian case, this forward-backward recursion is known as the Kalman filter-smoother (Roweis and Ghahramani 1999; Durbin andKoopman 2001; Penny et al. 2005; Shumway and Stoffer2006).

Unfortunately, the integrals in Eqs. (1) and (2) arenot analytically tractable in general; in particular, forneural applications we are interested in cases where theobservations yt are point processes (e.g., spike trains, orbehavioral event times), and in this case the recursionsmust be solved approximately. One straightforwardidea is to approximate the conditional distributionsappearing in (1) and (2) as Gaussian; since we can com-pute Gaussian integrals analytically (as in the Kalmanfilter), this simple approximation provides a computa-tionally tractable, natural extension of the Kalman filterto non-Gaussian observations. Many versions of thisrecursive Gaussian approximation idea (with varyingdegrees of accuracy versus computational expediency)have been introduced in the statistics and neuroscienceliterature (Fahrmeir and Kaufmann 1991; Fahrmeir andTutz 1994; Bell 1994; Kitagawa and Gersch 1996; Westand Harrison 1997; Julier and Uhlmann 1997; Brownet al. 1998; Smith and Brown 2003; Ypma and Heskes2003; Eden et al. 2004; Yu et al. 2006).

These methods have proven extremely useful in awide variety of neural applications. Recursive estima-tion methods are especially critical in online applica-

tions, where estimates must be updated in real time asnew information is observed. For example, state-spacetechniques achieve state-of-the-art performance decod-ing multineuronal spike train data from motor cortex(Wu et al. 2006; Truccolo et al. 2005; Wu et al. 2009)and parietal cortex (Yu et al. 2006; Kemere et al. 2008),and these methods therefore hold great promise for thedesign of motor neuroprosthetic devices (Donoghue2002). In this setting, the hidden variable qt correspondsto the desired position of the subject’s hand, or a cursoron a computer screen, at time t; yt is the vector ofobserved spikes at time t, binned at some predeter-mined temporal resolution; the conditional probabilityp(yt|qt) is given by an “encoding” model that describeshow the position information qt is represented in thespike trains yt; and p(qt|Y1:t+s) is the desired fixed-lag decoding distribution, summarizing our knowledgeabout the current position qt given all of the observedspike train data Y from time 1 up to t + s, where sis a short allowed time lag (on the order of 100 msor so in motor prosthetic applications). In this setting,the conditional expectation E(qt|Y1:t+s) is typically usedas the optimal (minimum mean-square) estimator forqt, while the posterior covariance Cov(qt|Y1:t+s) quan-tifies our uncertainty about the position qt, given theobserved data; both of these quantities are computedmost efficiently using the forward-backward recur-sions (1–2). These forward-backward methods can alsoeasily incorporate target or endpoint goal informationin these online decoding tasks (Srinivasan et al. 2006;Yu et al. 2007; Kulkarni and Paninski 2008; Wu et al.2009).

State-space models have also been applied success-fully to track nonstationary neuron tuning properties(Brown et al. 2001; Frank et al. 2002; Eden et al. 2004;Czanner et al. 2008; Rahnama et al. 2009). In this case,the hidden state variable qt represents a parameter vec-tor which determines the neuron’s stimulus-responsefunction. Lewi et al. (2009) discusses an applicationof these recursive methods to perform optimal onlineexperimental design — i.e., to choose the stimulus attime t which will give us as much information as possi-ble about the observed neuron’s response properties,given all the observed stimulus-response data fromtime 1 to t.

A number of offline applications have appeared aswell: state-space methods have been applied to per-form optimal decoding of rat position given multiplehippocampal spike trains (Brown et al. 1998; Zhanget al. 1998; Eden et al. 2004), and to model behaviorallearning experiments (Smith and Brown 2003; Smithet al. 2004, 2005; Suzuki and Brown 2005); in thelatter case, qt represents the subject’s certainty about

J Comput Neurosci (2010) 29:107–126 109

the behavioral task, which is not directly observableand which changes systematically over the course ofthe experiment. In addition, we should note that theforward-backward idea is of fundamental importancein the setting of sequential Monte Carlo (“particle-filtering”) methods (Doucet et al. 2001; Brockwell et al.2004; Kelly and Lee 2004; Godsill et al. 2004; Shohamet al. 2005; Ergun et al. 2007; Vogelstein et al. 2009;Huys and Paninski 2009), though we will not focus onthese applications here.

However, the forward-backward approach is not al-ways directly applicable. For example, in many casesthe dynamics p(qt|qt−1) or observation density p(yt|qt)

may be non-smooth (e.g., the state variable q may beconstrained to be nonnegative, leading to a discontinu-ity in log p(qt|qt−1) at qt = 0). In these cases the forwarddistribution p(qt|Y1:t) may be highly non-Gaussian, andthe basic forward-backward Gaussian approximationmethods described above may break down.1 In thispaper, we will review more general direct optimiza-tion methods for performing inference in state-spacemodels. We discuss this approach in Section 2 below.This direct optimization approach also leads to moreefficient methods for estimating the model parametersθ (Section 3). Finally, the state-space model turns outto be a special case of a richer, more general frameworkinvolving banded matrix computations, as we discuss atmore length in Section 4.

2 A direct optimization approach for computingthe maximum a posteriori path in state-space models

2.1 A direct optimization interpretationof the classical Kalman filter

We begin by taking another look at the classicalKalman filter-smoother (Durbin and Koopman 2001;Wu et al. 2006; Shumway and Stoffer 2006). The pri-mary goal of the smoother is to compute the conditionalexpectation E(Q|Y) of the hidden state path Q giventhe observations Y. (Throughout this paper, we will useQ and Y to denote the full collection of the hiddenstate variables {qt} and observations {yt}, respectively.)Due to the linear-Gaussian structure of the Kalman

1It is worth noting that other more sophisticated methods suchas expectation propagation (Minka 2001; Ypma and Heskes2003; Yu et al. 2006, 2007; Koyama and Paninski 2009) may bebetter-equipped to handle these strongly non-Gaussian observa-tion densities p(yt|qt) (and are, in turn, closely related to theoptimization-based methods that are the focus of this paper);however, due to space constraints, we will not discuss thesemethods at length here.

model, (Q, Y) forms a jointly Gaussian random vector,and therefore p(Q|Y) is itself Gaussian. Since the meanand mode of a Gaussian distribution coincide, this im-plies that E(Q|Y) is equal to the maximum a poste-riori (MAP) solution, the maximizer of the posteriorp(Q|Y). If we write out the linear-Gaussian Kalmanmodel more explicitly,

qt = Aqt−1 + εt, εt ∼ N (0, Cq); q1 ∼ N (μ1, C1)

yt = Bqt + ηt, ηt ∼ N (0, Cy)

(where N (0, C) denotes the Gaussian density withmean 0 and covariance C), we can gain some insightinto the the analytical form of this maximizer:

E(Q|Y) = arg maxQ

p(Q|Y)

= arg maxQ

log p(Q, Y)

= arg maxQ

(log p(q1) +

T∑t=2

log p(qt|qt−1)

+T∑

t=1

log p(yt|qt)

)

= arg maxQ

[− 1

2

((q1−μ1)

TC−11 (q1−μ1)

+T∑

t=2

(qt− Aqt−1)TC−1

q (qt− Aqt−1)

+T∑

t=1

(yt−Bqt)TC−1

y (yt−Bqt)

)].

(3)

The right-hand-side here is a simple quadratic func-tion of Q (as expected, since p(Q|Y) is Gaussian, i.e.,log p(Q|Y) is quadratic), and therefore E(Q|Y) maybe computed by solving an unconstrained quadraticprogram in Q; we thus obtain

Q̂ = arg maxQ

log p(Q|Y) = arg maxQ

[1

2QT HQ + ∇T Q

]

= −H−1∇,

where we have abbreviated the Hessian and gradient oflog p(Q|Y):

∇ = ∇Q log p(Q|Y)∣∣

Q=0

H = ∇∇Q log p(Q|Y)∣∣

Q=0.

The next key point to note is that the Hessian matrixH is block-tridiagonal, since log p(Q|Y) is a sum of sim-

110 J Comput Neurosci (2010) 29:107–126

ple one-point potentials (log p(qt) and log p(yt|qt)) andnearest-neighbor two-point potentials (log p(qt, qt−1)).More explicitly, we may write

H =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

D1 RT1,2 0 · · · 0

R1,2 D2 RT2,3 0

...

0 R2,3 D3 R3,4. . .

.... . .

. . .. . . 0

DN−1 RTN−1,N

0 · · · 0 RN−1,N DN

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

(4)

where

Di = ∂2

∂q2i

log p(yi|qi) + ∂2

∂q2i

log p(qi|qi−1)

+ ∂2

∂q2i

log p(qi+1|qi), (5)

and

Ri,i+1 = ∂2

∂qi∂qi+1log p(qi+1|qi) (6)

for 1 < i < N. These quantities may be computed assimple functions of the Kalman model parameters; forexample, Ri,i+1 = C−1

q A.This block-tridiagonal form of H implies that the

linear equation Q̂ = H−1∇ may be solved in O(T) time(e.g., by block-Gaussian elimination (Press et al. 1992);note that we never need to compute H−1 explicitly).Thus this matrix formulation of the Kalman smoother isequivalent both mathematically and in terms of compu-tational complexity to the forward-backward method.In fact, the matrix formulation is often easier to im-plement; for example, if H is sparse and banded, thestandard Matlab backslash command Q̂ = H\∇ callsthe O(T) algorithm automatically—Kalman smoothingin just one line of code.

We should also note that a second key applicationof the Kalman filter is to compute the posterior statecovariance Cov(qt|Y) and also the nearest-neighborsecond moments E(qtqT

t+1|Y); the posterior covarianceis required for computing confidence intervals aroundthe smoothed estimates E(qt|Y), while the second mo-ments E(qtqT

t+1|Y) are necessary to compute the suf-ficient statistics in the expectation-maximization (EM)algorithm for estimating the Kalman model parameters(see, e.g., Shumway and Stoffer 2006 for details). Thesequantities may easily be computed in O(T) time inthe matrix formulation. For example, since the matrixH represents the inverse posterior covariance matrix

of our Gaussian vector Q given Y, Cov(qt|Y) is givenby the (t, t)-th block of H−1, and it is well-known thatthe diagonal and off-diagonal blocks of the inverse ofa block-tridiagonal matrix can be computed in O(T)

time; again, the full inverse H−1 (which requires O(T2)

time in the block-tridiagonal case) is not required(Rybicki and Hummer 1991; Rybicki and Press 1995;Asif and Moura 2005).

2.2 Extending the direct optimization methodto non-Gaussian models

From here it is straightforward to extend this approachto directly compute Q̂MAP in non-Gaussian models ofinterest in neuroscience. In this paper we will focuson the case that log p(qt+1|qt) is a concave function ofQ; in addition, we will assume that the initial densitylog p(q0) is concave and also that the observation den-sity log p(yt|qt) is concave in qt. Then it is easy to seethat the log-posterior

log p(Q|Y) = log p(q0) +∑

t

log p(yt|qt)

+∑

t

log p(qt+1|qt) + const.

is concave in Q, and therefore computing the MAPpath Q̂ is a concave problem. Further, if log p(q0),log p(yt|qt), and log p(qt+1|qt) are all smooth functionsof Q, then we may apply standard approaches such asNewton’s algorithm to solve this concave optimization.

To apply Newton’s method here, we simply itera-tively solve the linear equation2

Q̂(i+1) = Q̂(i) − H−1∇,

where we have again abbreviated the Hessian and gra-dient of the objective function log p(Q|Y):

∇ = ∇Q log p(Q|Y)∣∣

Q=Q̂(i)

H = ∇∇Q log p(Q|Y)∣∣

Q=Q̂(i) .

Clearly, the only difference between the general non-Gaussian case here and the special Kalman case

2In practice, the simple Newton iteration does not always in-crease the objective log p(Q|Y); we have found the standardremedy for this instability (perform a simple backtracking line-search along the Newton direction Q̂(i) − δ(i) H−1∇ to determinea suitable stepsize δ(i) ≤ 1) to be quite effective here.

J Comput Neurosci (2010) 29:107–126 111

described above is that the Hessian H and gradient∇ must be recomputed at each iteration Q̂(i); in theKalman case, again, log p(Q|Y) is a quadratic function,and therefore the Hessian H is constant, and one it-eration of Newton’s method suffices to compute theoptimizer Q̂.

In practice, this Newton algorithm converges withina few iterations for all of the applications discussedhere. Thus we may compute the MAP path exactlyusing this direct method, in time comparable to thatrequired to obtain the approximate MAP path com-puted by the recursive approximate smoothing algo-rithm discussed in Section 1. This close connectionbetween the Kalman filter and the Newton-based com-putation of the MAP path in more general state-spacemodels is well-known in the statistics and applied mathliterature (though apparently less so in the neurosci-ence literature). See Fahrmeir and Kaufmann (1991),Fahrmeir and Tutz (1994), Bell (1994), Davis andRodriguez-Yam (2005), Jungbacker and Koopman(2007) for further discussion from a statistical pointof view, and Koyama and Paninski (2009) for ap-plications to the integrate-and-fire model for spikingdata. In addition, Yu et al. (2007) previously applieda related direct optimization approach in the con-text of neural decoding (though note that the conju-gate gradients approach utilized there requires O(T3)

time if the banded structure of the Hessian is notexploited).

2.3 Example: inferring common input effectsin multineuronal spike train recordings

Recent developments in multi-electrode recordingtechnology (Nicolelis et al. 2003; Litke et al. 2004)and fluorescence microscopy (Cossart et al. 2003; Ohkiet al. 2005; Nikolenko et al. 2008) enable the simul-taneous measurement of the spiking activity of manyneurons. Analysis of such multineuronal data is oneof the key challenges in computational neurosciencetoday (Brown et al. 2004), and a variety of mod-els for these data have been introduced (Chornoboyet al. 1988; Utikal 1997; Martignon et al. 2000; Iyengar2001; Schnitzer and Meister 2003; Paninski et al. 2004;Truccolo et al. 2005; Nykamp 2005; Schneidman et al.2006; Shlens et al. 2006; Pillow et al. 2008; Shlenset al. 2009). Most of these models include stimulus-dependence terms and “direct coupling” terms repre-senting the influence that the activity of an observedcell might have on the other recorded neurons. Thesecoupling terms are often interpreted in terms of “func-tional connectivity” between the observed neurons; the

major question now is how accurately this inferredfunctional connectivity actually reflects the true under-lying anatomical connectivity in the circuit.

Fewer models, however, have attempted to includethe effects of the population of neurons which arenot directly observed during the experiment (Nykamp2005, 2007; Kulkarni and Paninski 2007). Since we candirectly record from only a small fraction of neurons inany physiological preparation, such unmeasured neu-rons might have a large collective impact on the dy-namics and coding properties of the observed neuralpopulation, and may bias the inferred functional con-nectivity away from the true anatomical connectivity,complicating the interpretation of these multineuronalanalyses. For example, while Pillow et al. (2008) foundthat neighboring parasol retinal ganglion cells (RGCs)in the macaque are functionally coupled — indeed,incorporating this functional connectivity in an opti-mal Bayesian decoder significantly amplifies the in-formation we can extract about the visual stimulusfrom the observed spiking activity of large ensemblesof RGCs — Khuc-Trong and Rieke (2008) recentlydemonstrated, via simultaneous pairwise intracellularrecordings, that RGCs receive a significant amount ofstrongly correlated common input, with weak directanatomical coupling between RGCs. Thus the strongfunctional connectivity observed in this circuit is in factlargely driven by common input, not direct anatomicalconnectivity.

Therefore it is natural to ask if it is possible tocorrectly infer the degree of common input versusdirect coupling in partially-observed neuronal circuits,given only multineuronal spike train data (i.e., we donot want to rely on multiple simultaneous intracellularrecordings, which are orders of magnitude more dif-ficult to obtain than extracellular recordings). To thisend, Kulkarni and Paninski (2007) introduced a state-space model in which the firing rates depend not onlyon the stimulus history and the spiking history of theobserved neurons but also on common input effects(Fig. 1). In this model, the conditional firing intensity,λi(t), of the i-th observed neuron is:

λi(t) = exp

⎛⎝ki · x(t) + hi · yi(t) +

( ∑i �= j

lij · y j(t))

+μi + qi(t)

⎞⎠ , (7)

where x is the spatiotemporal visual stimulus, yi is celli’s own spike-train history, μi is the cell’s baseline log-firing rate, y j are the spike-train histories of other cells

112 J Comput Neurosci (2010) 29:107–126

Fig. 1 Schematic illustration of the common-input model de-scribed by Eq. (7); adapted from Kulkarni and Paninski (2007)

at time t, ki is the cell’s spatiotemporal stimulus filter,hi is the post-spike temporal filter accounting for pastspike dependencies within cell i, and lij are direct cou-pling temporal filters, which capture the dependenceof cell i’s activity on the recent spiking of other cellsj. The term qi(t), the hidden common input at time t,is modeled as a Gauss-Markov autoregressive process,with some correlation between different cells i whichwe must infer from the data. In addition, we enforce anonzero delay in the direct coupling terms, so that theeffects of a spike in one neuron on other neurons aretemporally strictly causal.

In statistical language, this common-input model isa multivariate version of a Cox process, also knownas a doubly-stochastic point process (Cox 1955; Snyderand Miller 1991; Moeller et al. 1998); the state-spacemodels applied in Smith and Brown (2003), Truccoloet al. (2005), Czanner et al. (2008) are mathematicallyvery similar. See also Yu et al. (2006) for discussion ofa related model in the context of motor planning andintention.

As an example of the direct optimization methodsdeveloped in the preceding subsection, we reanalyzedthe data from Pillow et al. (2008) with this commoninput model (Vidne et al. 2009). We estimated themodel parameters θ = (ki, hi, lij, μi) from the spikingdata by maximum marginal likelihood, as describedin Koyama and Paninski (2009) (see also Section 3below, for a brief summary); the correlation time of Qwas set to ∼ 5 ms, to be consistent with the results ofKhuc-Trong and Rieke (2008). We found that thiscommon-input model explained the observed cross-correlations quite well (data not shown), and the in-ferred direct-coupling weights were set to be relativelysmall (Fig. 2); in fact, the quality of the fits in our

preliminary experiments is indistinguishable from thosedescribed in Pillow et al. (2008), where a model withstrong direct-coupling terms and no common-input ef-fects was used.

Given the estimated model parameters θ , we usedthe direct optimization method to estimate the sub-threshold common input effects q(t), on a single-trialbasis (Fig. 2). The observation likelihood p(yt|qt) herewas given by the standard point-process likelihood(Snyder and Miller 1991):

log p(Y|Q) =∑

it

yit log λi(t) − λi(t)dt, (8)

where yit denotes the number of spikes observed intime bin t from neuron i; dt denotes the temporal bin-width. We see in Fig. 2 that the inferred common inputeffect is strong relative to the direct coupling effects, inagreement with the intracellular recordings describedin Khuc-Trong and Rieke (2008). We are currentlyworking to quantify these common input effects qi(t)inferred from the full observed RGC population, ratherthan just the pairwise analysis shown here, in orderto investigate the relevance of this strong commoninput effect on the coding and correlation propertiesof the full network of parasol cells. See also Wu et al.(2009) for applications of similar common-input state-space models to improve decoding of population neuralactivity in motor cortex.

2.4 Constrained optimization problemsmay be handled easily via the log-barriermethod

So far we have assumed that the MAP path may becomputed via an unconstrained smooth optimization.In many examples of interest we have to deal with con-strained optimization problems instead. In particular,nonnegativity constraints arise frequently on physicalgrounds; as emphasized in the introduction, forward-backward methods based on Gaussian approximationsfor the forward distribution p(qt|Y0:t) typically do notaccurately incorporate these constraints. To handlethese constrained problems while exploiting the fasttridiagonal techniques discussed above, we can employstandard interior-point (aka “barrier”) methods (Boydand Vandenberghe 2004; Koyama and Paninski 2009).The idea is to replace the constrained concave problem

Q̂MAP = arg maxQ:qt≥0

log p(Q|Y)

J Comput Neurosci (2010) 29:107–126 113

common input

–2–1

01

–2–1

01

direct coupling input

–202

stimulus input

–2

–1

0

refractory input

100 200 300 400 500 600 700 800 900 1000

spikes

ms

Fig. 2 Single-trial inference of the relative contribution of com-mon, stimulus, direct coupling, and self inputs in a pair of retinalganglion ON cells (Vidne et al. 2009); data from (Pillow et al.2008). Top panel: Inferred linear common input, Q̂: red traceshows a sample from the posterior distribution p(Q|Y), blacktrace shows the conditional expectation E(Q|Y), and shadedregion indicates ±1 posterior standard deviation about E(Q|Y),computed from the diagonal of the inverse log-posterior HessianH. 2nd panel: Direct coupling input from the other cell, l j · y j.

(The first two panels are plotted on the same scale to facilitatecomparison of the magnitudes of these effects.) Blue trace indi-cates cell 1; green indicates cell 2. 3rd panel: The stimulus input,k · x. 4th panel: Refractory input, hi · yi. Note that this term isstrong but quite short-lived following each spike. All units arein log-firing rate, as in Eq. (7). Bottom: Observed paired spiketrains Y on this single trial. Note the large magnitude of theestimated common input term q̂(t), relative to the direct couplingcontribution l j · y j

with a sequence of unconstrained concave problems

Q̂ε = arg maxQ

log p(Q|Y) + ε∑

t

log qt;

clearly, Q̂ε satisfies the nonnegativity constraint, sincelog u → −∞ as u → 0. (We have specialized to thenonnegative case for concreteness, but the idea maybe generalized easily to any convex constraint set; seeBoyd and Vandenberghe 2004 for details.) Further-more, it is easy to show that if Q̂MAP is unique, thenQ̂ε converges to Q̂MAP as ε → 0.

Now the key point is that the Hessian of theobjective function log p(Q|Y) + ε

∑t log qt retains the

block-tridiagonal properties of the original objectivelog p(Q|Y), since the barrier term contributes a simplediagonal term to H. Therefore we may use the O(T)

Newton iteration to obtain Q̂ε , for any ε, and thensequentially decrease ε (in an outer loop) to obtainQ̂. Note that the approximation arg maxQ p(Q|Y) ≈E(Q|Y) will typically not hold in this constrained case,since the mean of a truncated Gaussian distribution willtypically not coincide with the mode (unless the modeis sufficiently far from the nonnegativity constraint).

We give a few applications of this barrier approachbelow. See also Koyama and Paninski (2009) for adetailed discussion of an application to the integrate-and-fire model, and Vogelstein et al. (2008) for appli-cations to the problem of efficiently deconvolving slow,noisy calcium fluorescence traces to obtain nonnegativeestimates of spiking times. In addition, see Cunninghamet al. (2008) for an application of the log-barrier methodto infer firing rates given point process observations ina closely-related Gaussian process model (Rasmussen

114 J Comput Neurosci (2010) 29:107–126

and Williams 2006); these authors considered a slightlymore general class of covariance functions on the latentstochastic process qt, but the computation time of theresulting method scales superlinearly3 with T.

2.4.1 Example: point process smoothingunder Lipschitz or monotonicityconstraints on the intensity function

A standard problem in neural data analysis is tosmooth point process observations; that is, to estimatethe underlying firing rate λ(t) given single or multi-ple observations of a spike train (Kass et al. 2003).One simple approach to this problem is to modelthe firing rate as λ(t) = f (qt), where f (.) is a convex,log-concave, monotonically increasing nonlinearity(Paninski 2004) and qt is an unobserved function oftime we would like to estimate from data. Of course,if qt is an arbitrary function, we need to contend withoverfitting effects; the “maximum likelihood” Q̂ herewould simply set f (qt) to zero when no spikes areobserved (by making −qt very large) and f (qt) to bevery large when spikes are observed (by making qt verylarge).

A simple way to counteract this overfitting effect isto include a penalizing prior; for example, if we modelqt as a linear-Gaussian autoregressive process

qt+dt = qt + εt, εt ∼ N (0, σ 2dt),

then computing Q̂MAP leads to a tridiagonal optimiza-tion, as discussed above. (The resulting model, again,is mathematically equivalent to those applied in Smithand Brown 2003, Truccolo et al. 2005, Kulkarni andPaninski 2007, Czanner et al. 2008, Vidne et al. 2009.)Here 1/σ 2 acts as a regularization parameter: if σ 2 issmall, the inferred Q̂MAP will be very smooth (sincelarge fluctuations are penalized by the Gaussian autore-gressive prior), whereas if σ 2 is large, then the priorterm will be weak and Q̂MAP will fit the observed datamore closely.

A different method for regularizing Q was intro-duced by Coleman and Sarma (2007). The idea is to

3More precisely, Cunningham et al. (2008) introduce a cleveriterative conjugate-gradient (CG) method to compute the MAPpath in their model; this method requires O(T log T) time perCG step, with the number of CG steps increasing as a function ofthe number of observed spikes. (Note, in comparison, that thecomputation times of the state-space methods reviewed in thecurrent work are insensitive to the number of observed spikes.)

impose hard Lipschitz constraints on Q, instead of thesoft quadratic penalties imposed in the Gaussian state-space setting: we assume

|qt − qs| < K|t − s|

for all (s, t), for some finite constant K. (If qt is adifferentiable function of t, this is equivalent to theassumption that the maximum absolute value of thederivative of Q is bounded by K.) The space of all suchLipschitz Q is convex, and so optimizing the concaveloglikelihood function under this convex constraint re-mains tractable. Coleman and Sarma (2007) presenteda powerful method for solving this optimization prob-lem (their solution involved a dual formulation of theproblem and an application of specialized fast min-cut optimization methods). In this one-dimensionaltemporal smoothing case, we may solve this problemin a somewhat simpler way, without any loss of effi-ciency, using the tridiagonal log-barrier methods de-scribed above. We just need to rewrite the constrainedproblem

maxQ log p(Q|Y) s.t. |qt − qs| < K|t − s| ∀s, t

as the unconstrained problem

maxQ log p(Q|Y) +∑

t

b(

qt − qt+dt

dt

),

with dt some arbitrarily small constant and the hardbarrier function b(.) defined as

b(u) ={

0 |u| < K

−∞ otherwise.

The resulting concave objective function is non-smooth, but may be optimized stably, again, via thelog-barrier method, with efficient tridiagonal Newtonupdates. (In this case, the Hessian of the first termlog p(Q|Y) with respect to Q is diagonal and theHessian of the penalty term involving the barrier func-tion is tridiagonal, since b(.) contributes a two-pointpotential here.) We recover the standard state-spaceapproach if we replace the hard-threshold penalty func-tion b(.) with a quadratic function; conversely, we mayobtain sharper estimates of sudden changes in qt if weuse a concave penalty b(.) which grows less steeply thana quadratic function (so as to not penalize large changesin qt as strongly), as discussed by Gao et al. (2002).Finally, it is interesting to note that we may also easilyenforce monotonicity constraints on qt, by choosing the

J Comput Neurosci (2010) 29:107–126 115

penalty function b(u) to apply a barrier at u = 0; thisis a form of isotonic regression (Silvapulle and Sen2004), and is useful in cases where we believe thata cell’s firing rate should be monotonically increasingor decreasing throughout the course of a behavioraltrial, or as a function of the magnitude of an appliedstimulus.

2.4.2 Example: inferring presynaptic inputsgiven postsynaptic voltage recordings

To further illustrate the flexibility of this method,let’s look at a multidimensional example. Consider theproblem of identifying the synaptic inputs a neuronis receiving: given voltage recordings at a postsynap-tic site, is it possible to recover the time course ofthe presynaptic conductance inputs? This question hasreceived a great deal of experimental and analyticalattention (Borg-Graham et al. 1996; Peña and Konishi2000; Wehr and Zador 2003; Priebe and Ferster 2005;Murphy and Rieke 2006; Huys et al. 2006; Wang et al.2007; Xie et al. 2007; Paninski 2009), due to the impor-tance of understanding the dynamic balance betweenexcitation and inhibition underlying sensory informa-tion processing.

We may begin by writing down a simple state-spacemodel for the evolution of the postsynaptic voltage andconductance:

Vt+dt = Vt + dt[gL(VL − Vt) + gE

t (VE − Vt)

+gIt (V I − Vt)

] + εt

gEt+dt = gE

t − gEt

τEdt + NE

t

gIt+dt = gI

t − gIt

τIdt + NI

t . (9)

Here gEt denotes the excitatory presynaptic conduc-

tance at time t, and gIt the inhibitory conductance;

VL, VE, and V I denote the leak, excitatory, andinhibitory reversal potentials, respectively. Finally, εt

denotes an unobserved i.i.d. current noise with a log-concave density, and NE

t and NIt denote the presynaptic

excitatory and inhibitory inputs (which must be non-negative on physical grounds); we assume these inputsalso have a log-concave density.

Assume Vt is observed noiselessly for simplicity.Then let our observed variable yt = Vt+dt − Vt and ourstate variable qt = (gE

t gIt )

T . Now, since gIt and gE

tare linear functions of NI

t and NEt (for example, gI

t

is given by the convolution gIt = NI

t ∗ exp(−t/τI)), thelog-posterior may be written as

log p(Q|Y) = log p(Y|Q) + log p(NIt , NE

t ) + const.

= log p(Y|Q) +T∑

t=1

log p(NEt )

+T∑

t=1

log p(NIt ) + const., NE

t , NIt ≥ 0;

in the case of white Gaussian current noise εt withvariance σ 2dt, for example,4 we have

log p(Y|Q)=− 1

2σ 2dt

T∑t=2

[Vt+dt−

(Vt+dt

(gL(VL−Vt)

+gIt (V I −Vt)+gE

t (VE−Vt)))]2+const.;

this is a quadratic function of Q.Now applying the O(T) log-barrier method is

straightforward; the Hessian of the log-posteriorlog p(Q|Y) in this case is block-tridiagonal, withblocks of size two (since our state variable qt istwo-dimensional). The observation term log p(Y|Q)

contributes a block-diagonal term to the Hessian; inparticular, each observation yt contributes a rank-1matrix of size 2 × 2 to the t-th diagonal block of H. (Thelow-rank nature of this observation matrix reflects thefact that we are attempting to extract two variables—the excitatory and inhibitory conductances at each timestep—given just a single voltage observation per timestep.)

Some simulated results are shown in Fig. 3. Wegenerated Poisson spike trains from both inhibitory andexcitatory presynaptic neurons, then formed the postsy-naptic current signal It by contaminating the summedsynaptic and leak currents with white Gaussian noise asin Eq. (9), and then used the O(T) log-barrier methodto simultaneously infer the presynaptic conductances

4The approach here can be easily generalized to the case thatthe input noise has a nonzero correlation timescale. For exam-ple, if the noise can be modeled as an autoregressive processof order p instead of the white noise process described here,then we simply include the unobserved p-dimensional Markovnoise process in our state variable (i.e., our Markov state vari-able qt will now have dimension p + 2 instead of 2), and thenapply the O(T) log-barrier method to this augmented statespace.

116 J Comput Neurosci (2010) 29:107–126

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

true

g

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1–50

0

50

100

I(t)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.5

1

1.5

2

2.5

estim

ated

g

time (sec)

Fig. 3 Inferring presynaptic inputs given simulated postsynapticvoltage recordings. Top: true simulated conductance input (greenindicates inhibitory conductance; blue excitatory). Middle: ob-served noisy current trace from which we will attempt to inferthe input conductance. Bottom: Conductance inferred by nonneg-ative MAP technique. Note that inferred conductance is shrunkin magnitude compared to true conductance, due to the effectsof the prior p(NE

t ) and p(NIt ), both of which peak at zero here;

shrinkage is more evident in the inferred inhibitory conductance,due to the smaller driving force (the holding potential in thisexperiment was −62 mV, which is quite close to the inhibitoryreversal potential; as a result, the likelihood term is much weakerfor the inhibitory conductance than for the excitatory term).Inference here required about one second on a laptop computerper second of data (i.e., real time), at a sampling rate of 1 KHz

from the observed current It. The current was recordedat 1 KHz (1 ms bins), and we reconstructed the presy-naptic activity at the same time resolution. We see thatthe estimated Q̂ here does a good job extracting bothexcitatory and inhibitory synaptic activity given a singletrace of observed somatic current; there is no needto average over multiple trials. It is worth emphasiz-ing that we are inferring two presynaptic signals heregiven just one observed postsynaptic current signal,with limited “overfitting” artifacts; this is made pos-sible by the sparse, nonnegatively constrained natureof the inferred presynaptic signals. For simplicity, weassumed that the membrane leak, noise variance, andsynaptic time constants τE and τI were known here; weused exponential (sparsening) priors p(NE

t ) and p(NIt ),

but the results are relatively robust to the details ofthese priors (data not shown). See Huys et al. (2006),Huys and Paninski (2009), Paninski (2009) for furtherdetails and extensions, including methods for inferringthe membrane parameters directly from the observeddata.

3 Parameter estimation

In the previous sections we have discussed the inferenceof the hidden state path Q in the case that the systemparameters are known. However, in most applications,we need to estimate the system parameters as well.As discussed in Kass et al. (2005), standard modernmethods for parameter estimation are based on thelikelihood p(Y|θ) of the observed data Y given theparameters θ . In the state-space setting, we need tocompute the marginal likelihood by integrating out thehidden state path Q:5

p(Y|θ)=∫

p(Q, Y|θ)dQ=∫

p(Y|Q, θ)p(Q|θ)dQ. (10)

5In some cases, Q may be observed directly on some subset oftraining data. If this is the case (i.e., direct observations of qt areavailable together with the observed data Y), then the estimationproblem simplifies drastically, since we can often fit the modelsp(yt|qt, θ) and p(qt|qt−1, θ) directly without making use of themore involved latent-variable methods discussed in this section.

J Comput Neurosci (2010) 29:107–126 117

This marginal likelihood is a unimodal function ofthe parameters θ in many important cases (Paninski2005), making maximum likelihood estimation feasi-ble in principle. However, this high-dimensional inte-gral can not be computed exactly in general, and somany approximate techniques have been developed,including Monte Carlo methods (Robert and Casella2005; Davis and Rodriguez-Yam 2005; Jungbacker andKoopman 2007; Ahmadian et al. 2009a) and expecta-tion propagation (Minka 2001; Ypma and Heskes 2003;Yu et al. 2006; Koyama and Paninski 2009).

In the neural applications reviewed in Section 1, themost common method for maximizing the loglikelihoodis the approximate Expectation-Maximization (EM)algorithm introduced in Smith and Brown (2003), inwhich the required expectations are approximated us-ing the recursive Gaussian forward-backward method.This EM algorithm can be readily modified to optimizean approximate log-posterior p(θ |Y), if a useful priordistribution p(θ) is available (Kulkarni and Paninski2007). While the EM algorithm does not require usto compute the likelihood p(Y|θ) explicitly, we mayread this likelihood off of the final forward densityapproximation p(qT , y1:T) by simply marginalizing outthe final state variable qT . All of these computationsare recursive, and may therefore be computed in O(T)

time.The direct global optimization approach discussed

in the preceding section suggests a slightly differentstrategy. Instead of making T local Gaussian approx-imations recursively—once for each forward densityp(qt, y1:t)—we make a single global Gaussian approx-imation for the full joint posterior:

log p(Q|Y, θ) ≈−T2

log(2π)− 1

2log | − Hθ |

+1

2(Q− Q̂θ )

T Hθ (Q− Q̂θ ), (11)

where the Hessian Hθ is defined as

Hθ = ∇∇Q(

log p(Q|θ) + log p(Y|Q, θ))∣∣∣∣

Q=Q̂θ

;

note the implicit dependence on θ through

Q̂θ ≡ arg maxQ

[log p(Q|θ) + log p(Y|Q, θ)

].

Equation (11) corresponds to nothing more than asecond-order Taylor expansion of log p(Q|Y, θ) aboutthe optimizer Q̂θ .

Plugging this Gaussian approximation into Eq. (10),we obtain the standard “Laplace” approximation (Kass

and Raftery 1995; Davis and Rodriguez-Yam 2005; Yuet al. 2007) for the marginal likelihood,

log p(Y|θ) ≈ log p(Y|Q̂θ , θ) + log p(Q̂θ |θ)

−1

2log | − Hθ | + const. (12)

Clearly the first two terms here can be computedin O(T), since we have already demonstrated thatwe can obtain Q̂θ in O(T) time, and evaluatinglog p(Y|Q, θ) + log p(Q|θ) at Q = Q̂θ is relativelyeasy. We may also compute log | − Hθ | stably and inO(T) time, via the Cholesky decomposition for bandedmatrices (Davis and Rodriguez-Yam 2005; Koyamaand Paninski 2009). In fact, we can go further: Koyamaand Paninski (2009) show how to compute the gradientof Eq. (12) with respect to θ in O(T) time, which makesdirect optimization of this approximate likelihood fea-sible via conjugate gradient methods.6

It is natural to compare this direct optimizationapproach to the EM method; again, see Koyama andPaninski (2009) for details on the connections be-tween these two approaches. It is well-known thatEM can converge slowly, especially in cases in whichthe so-called “ratio of missing information” is large;see Dempster et al. (1977); Meng and Rubin (1991);Salakhutdinov et al. (2003); Olsson et al. (2007) fordetails. In practice, we have found that direct gradientascent of expression (12) is significantly more efficientthan the EM approach in the models discussed in thispaper; for example, we used the direct approach toperform parameter estimation in the retinal examplediscussed above in Section 2.3. One important advan-tage of the direct ascent approach is that in some specialcases, the optimization of (12) can be performed ina single step (as opposed to multiple EM steps). Weillustrate this idea with a simple example below.

3.1 Example: detecting the location of a synapsegiven noisy, intermittent voltage observations

Imagine we make noisy observations from a dendritictree (for example, via voltage-sensitive imaging meth-ods (Djurisic et al. 2008)) which is receiving synaptic in-puts from another neuron. We do not know the strength

6It is worth mentioning the work of Cunningham et al. (2008)again here; these authors introduced conjugate gradient methodsfor optimizing the marginal likelihood in their model. However,their methods require computation time scaling superlinearlywith the number of observed spikes (and therefore superlinearlywith T, assuming that the number of observed spikes is roughlyproportional to T).

118 J Comput Neurosci (2010) 29:107–126

or location of these inputs, but we do have completeaccess to the spike times of the presynaptic neuron (forexample, we may be stimulating the presynaptic neuronelectrically or via photo-uncaging of glutamate near thepresynaptic cell (Araya et al. 2006; Nikolenko et al.2008)). How can we determine if there is a synapsebetween the two cells, and if so, how strong the synapseis and where it is located on the dendritic tree?

To model this experiment, we assume that theneuron is in a stable, subthreshold regime, i.e., thespatiotemporal voltage dynamics are adequately ap-proximated by the linear cable equation

�Vt+dt = �Vt + dt(

A �Vt + θUt) + εt, εt ∼ N (0, CVdt).

(13)

Here the dynamics matrix A includes both leak termsand intercompartmental coupling effects: for example,in the special case of a linear dendrite segment with Ncompartments, with constant leak conductance g andintercompartmental conductance a, A is given by thetridiagonal matrix

A = −gI + aD2,

with D2 denoting the second-difference operator. Forsimplicity, we assume that Ut is a known signal:

Ut = h(t) ∗∑

i

δ(t − ti);

h(t) is a known synaptic post-synaptic (PSC) currentshape (e.g., an α-function (Koch 1999)), ∗ denotes con-volution, and

∑i δ(t − ti) denotes the presynaptic spike

train. The weight vector θ is the unknown parameterwe want to infer: θi is the synaptic weight at the i-thdendritic compartment. Thus, to summarize, we haveassumed that each synapse fires deterministically, witha known PSC shape (only the magnitude is unknown)at a known latency, with no synaptic depression orfacilitation. (All of these assumptions may be relaxedsignificantly (Paninski and Ferreira 2008).)

Now we would like to estimate the synaptic weightvector θ , given U and noisy observations of the spa-tiotemporal voltage V. V is not observed directly here,and therefore plays the role of our hidden variable Q.For concreteness, we further assume that the observa-tions Y are approximately linear and Gaussian:

yt = B �Vt + ηt, ηt ∼ N (0, Cy).

In this case the voltage V and observed data Y arejointly Gaussian given U and the parameter θ , and

furthermore V depends on θ linearly, so estimatingθ can be seen as a rather standard linear-Gaussianestimation problem. There are many ways to solvethis problem: we could, for example, use EM to alter-natively estimate V given θ and Y, then estimate θ

given our estimated V, alternating these maximizationsuntil convergence. However, a more efficient approach(and one which generalizes nicely to the nonlinear case(Koyama and Paninski 2009)) is to optimize Eq. (12)directly. Note that this Laplace approximation is infact exact in this case, since the posterior p(Y|θ)

is Gaussian. Furthermore, the log-determinant termlog | − Hθ | is constant in θ (since the Hessian is constantin this Gaussian model), and so we can drop this termfrom the optimization. Thus we are left with

arg maxθ

log p(Y|θ)

= arg maxθ

{log p(Y|V̂θ , θ) + log p(V̂θ |θ)

}

= arg maxθ

{log p(Y|V̂θ ) + log p(V̂θ |θ)

}

= arg maxθ

maxV

{log p(Y|V) + log p(V|θ)

}; (14)

i.e., optimization of the marginal likelihood p(Y|θ)

reduces here to joint optimization of the functionlog p(Y|V) + log p(V|θ) in (V, θ). Since this function isjointly quadratic and negative semidefinite in (V, θ), weneed only apply a single step of Newton’s method. Nowif we examine the Hessian H of this objective function,we see that it is of block form:

H = ∇∇(V,θ)

[log p(Y|V) + log p(V|θ)

]

=(

HVV HTVθ

HθV Hθθ

),

where the HVV block is itself block-tridiagonal, with Tblocks of size n (where n is the number of compart-ments in our observed dendrite) and the Hθθ block isof size n × n. If we apply the Schur complement to thisblock H and then exploit the fast methods for solvinglinear equations involving HVV , it is easy to see thatsolving for θ̂ can be done in a single O(T) step; seeKoyama and Paninski (2009) for details.

Figure 4 shows a simulated example of this inferredθ in which �U(t) is chosen to be two-dimensional (cor-responding to inputs from two presynaptic cells, oneexcitatory and one inhibitory); given only half a sec-ond of intermittently-sampled, noisy data, the posteriorp(θ |Y) is quite accurately concentrated about the trueunderlying value of θ .

J Comput Neurosci (2010) 29:107–126 119

2 4 6 8 10 12 14–0.2

0

0.2

0.4

syn.

wei

ghts

compartment

–1

0

1

I syn

time (sec)

com

part

men

t

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

5

10

15 -60.5

-60

-59.5

-59

5

10

15

Fig. 4 Estimating the location and strength of synapses on asimulated dendritic branch. Left: simulation schematic. By ob-serving a noisy, subsampled spatiotemporal voltage signal on thedendritic tree, we can infer the strength of a given presynapticcell’s inputs at each location on the postsynaptic cell’s dendritictree. Right: illustration of the method applied to simulated data(Paninski and Ferreira 2008). We simulated two presynapic in-puts: one excitatory (red) and one inhibitory (blue). The (known)presynaptic spike trains are illustrated in the top panel, convolvedwith exponential filters of τ = 3 ms (excitatory) and τ = 2 ms(inhibitory) to form Ut. Second panel: True (unobserved) voltage(mV), generated from the cable Eq. (13). Note that each presy-naptic spike leads to a post-synaptic potential of rather smallmagnitude (at most ≈ 1 mV), relative to the voltage noise level.

In this case the excitatory presynaptic neuron synapses twice onthe neuron, on compartment 12 and and a smaller synapse oncompartment 8, while the inhibitory neuron synapses on com-partment 3. Third panel: Observed (raster-scanned) voltage. Thetrue voltage was not observed directly; instead, we only observeda noisy, spatially-rastered (linescanned), subsampled version ofthis signal. Note the very low effective SNR here. Bottom panel:True and inferred synaptic weights. The true weight of eachsynapse is indicated by an asterisk (*) and the errorbar showsthe posterior mean E(θi|Y) and standard deviation

√Var(θi|Y) of

the synaptic weight given the observed data. Note that inferenceis quite accurate, despite the noisiness and the brief duration (500ms) of the data

The form of the joint optimization in Eq. (14) sheds agood deal of light on the relatively slow behavior of theEM algorithm here. The E step here involves inferringV given Y and θ ; in this Gaussian model, the mean andmode of the posterior p(V|θ, Y) are equal, and so wesee that

V̂ = arg maxV

log p(V|θ, Y)

= arg maxV

{log p(Y|V) + log p(V|θ)

}.

Similarly, in the M-step, we computearg maxθ log p(V̂|θ). So we see that EM here is simplycoordinate ascent on the objective function in Eq. (14):the E step ascends in the V direction, and the M stepascends in the θ direction. (In fact, it is well-known thatEM may be written as a coordinate ascent algorithmmuch more generally; see Neal and Hinton (1999)for details.) Coordinate ascent requires more stepsthan Newton’s method in general, due to the “zigzag”

nature of the solution path in the coordinate ascentalgorithm (Press et al. 1992), and this is exactly ourexperience in practice.7

7An additional technical advantage of the direct optimizationapproach is worth noting here: to compute the E step via theKalman filter, we need to specify some initial condition forp(V(0)). When we have no good information about the initialV(0), we can use “diffuse” initial conditions, and set the initialcovariance Cov(V(0)) to be large (even infinite) in some orall directions in the n-dimensional V(0)-space. A crude wayof handling this is to simply set the initial covariance in thesedirections to be very large (instead of infinite), though this canlead to numerical instability. A more rigorous approach is to takelimits of the update equations as the uncertainty becomes large,and keep separate track of the infinite and non-infinite termsappropriately; see Durbin and Koopman (2001) for details. Atany rate, these technical difficulties are avoided in the direct op-timization approach, which can handle infinite prior covarianceeasily (this just corresponds to a zero term in the Hessian of thelog-posterior).

120 J Comput Neurosci (2010) 29:107–126

As emphasized above, the linear-Gaussian case wehave treated here is special, because the Hessian H isconstant, and therefore the log |H| term in Eq. (12) canbe neglected. However, in some cases we can apply asimilar method even when the observations are non-Gaussian; see Koyama and Paninski (2009), Vidne et al.(2009) for examples and further details.

4 Generalizing the state-space method: exploitingbanded and sparse Hessian matrices

Above we have discussed a variety of applications ofthe direct O(T) optimization idea. It is natural to askwhether we can generalize this method usefully beyondthe state space setting. Let’s look more closely at theassumptions we have been exploiting so far. We haverestricted our attention to problems where the log-posterior p(Q|Y) is log-concave, with a Hessian Hsuch that the solution of the linear equation ∇ = HQcan be computed much more quickly than the stan-dard O(dim(Q)3) required by a generic linear equa-tion. In this section we discuss a couple of examplesthat do not fit gracefully in the state-space framework,but where nonetheless we can solve ∇ = HQ quickly,and therefore very efficient inference methods areavailable.

4.1 Example: banded matrices and fast optimalstimulus decoding

The neural decoding problem is a fundamental ques-tion in computational neuroscience (Rieke et al. 1997):given the observed spike trains of a population ofcells whose responses are related to the state of somebehaviorally-relevant signal x(t), how can we estimate,or “decode,” x(t)? Solving this problem experimentallyis of basic importance both for our understanding ofneural coding (Pillow et al. 2008) and for the designof neural prosthetic devices (Donoghue 2002). Accord-ingly, a rather large literature now exists on devel-oping and applying decoding methods to spike traindata, both in single cell- and population recordings; seePillow et al. (2009), Ahmadian et al. (2009a) for a recentreview.

We focus our attention here on a specific example.Let’s assume that the stimulus x(t) is one-dimensional,with a jointly log-concave prior p(X), and that theHessian of this log-prior is banded at every point X.Let’s also assume that the observed neural populationwhose responses we are attempting to decode may be

well-modeled by the generalized linear model frame-work applied in Pillow et al. (2008):

λi(t) = exp

⎛⎝ki ∗ x(t) + hi · yi +

(∑i �= j

lij · y j

)+ μi

⎞⎠ .

Here ∗ denotes the temporal convolution of the filterki against the stimulus x(t). This model is equivalent toEq. (7), but we have dropped the common-input termq(t) for simplicity here.

Now it is easy to see that the loglikelihoodlog p(Y|X) is concave, with a banded Hessian, withbandwidth equal to the length of the longest stimu-lus filter ki (Pillow et al. 2009). Therefore, Newton’smethod applied to the log-posterior log p(X|Y) re-quires just O(T) time, and optimal Bayesian decodinghere runs in time comparable to standard linear decod-ing (Warland et al. 1997). Thus we see that this stimulusdecoding problem is a rather straightforward extensionof the methods we have discussed above: instead ofdealing with block-tridiagonal Hessians, we are nowsimply exploiting the slightly more general case of abanded Hessian. See Fig. 5.

The ability to quickly decode the input stimulus x(t)leads to some interesting applications. For example, wecan perform perturbation analyses with the decoder: byjittering the observed spike times (or adding or remov-ing spikes) and quantifying how sensitive the decodedstimulus is to these perturbations, we may gain insightinto the importance of precise spike timing and correla-tions in this multineuronal spike train data (Ahmadianet al. 2009b). Such analyses would be formidably slowwith a less computationally-efficient decoder.

See Ahmadian et al. (2009a) for further applica-tions to fully-Bayesian Markov chain Monte Carlo(MCMC) decoding methods; bandedness can be ex-ploited quite fruitfully for the design of fast precondi-tioned Langevin-type algorithms (Robert and Casella2005). It is also worth noting that very similar bandedmatrix computations arise naturally in spline models(Wahba 1990; Green and Silverman 1994), which areat the heart of the powerful BARS method for neuralsmoothing (DiMatteo et al. 2001; Kass et al. 2003); seeAhmadian et al. (2009a) for a discussion of how toexploit bandedness in the BARS context.

4.2 Smoothness regularization and fast estimation ofspatial tuning fields

The applications discussed above all involve state-spacemodels which evolve through time. However, these

J Comput Neurosci (2010) 29:107–126 121

ideas are also quite useful in the context of spatialmodels Moeller and Waagepetersen (2004). Imaginewe would like to estimate some two-dimensional ratefunction from point process observations. Rahnamaet al. (2009) discuss a number of distinct cases of thisproblem, including the estimation of place fields inhippocampus (Brown et al. 1998) or of tuning func-tions in motor cortex (Paninski et al. 2004); for con-creteness, we focus here on the setting considered inCzanner et al. (2008). These authors analyzed repeatedobservations of a spike train whose mean rate functionchanged gradually from trial to trial; the goal of theanalysis here is to infer the firing rate λ(t, i), where tdenotes the time within a trial and i denotes the trialnumber.

One convenient approach to this problem is to modelλ(t, i) as

λ(t, i) = f [q(t, i)],

and then to discretize (t, i) into two-dimensional bins inwhich q(t, i) may be estimated by maximum likelihoodin each bin. However, as emphasized in Kass et al.(2003); Smith and Brown (2003); Kass et al. (2005);Czanner et al. (2008), this crude histogram approachcan lead to highly variable, unreliable estimates of thefiring rate λ(t, i) if the histogram bins are taken to betoo small; conversely, if the bins are too large, then

we may overcoarsen our estimates and lose valuableinformation about rapid changes in the firing rate as afunction of time t or trial number i.

A better approach is to use a fine binwidth to es-timate λ, but to regularize our estimate so that ourinferred λ̂ is not overly noisy. One simple approach isto compute a penalized maximum likelihood estimate

Q̂ = arg maxQ

[log p(Y|Q)−c1

∑it

[q(t, i)−q(t−dt, i)]2

−c2

∑it

[q(t, i)−q(t, i−1)]2

]; (15)

the observation likelihood p(Y|Q) is given by the stan-dard point-process log likelihood

log p(Y|Q) =∑

it

yit log f [q(t, i)] − f [q(t, i)]dt,

where dt denotes the temporal binwidth; yit here de-notes the number of spikes observed in time bin tduring trial i (c.f. Eq. (8)). The constants c1 and c2

serve as regularizer weights: if c1 is large, then wepenalize strongly for fluctuations in q(t, i) along the t-axis, whereas conversely c2 sets the smoothness alongthe i-axis.

Fig. 5 MAP decoding of thespatio-temporal stimuluski ∗ x(t) from thesimultaneously recordedspike trains of three pairs ofON and OFF retinal ganglioncells, again from Pillow et al.(2008). The top six panelsshow the true input ki ∗ x(t)to each cell (jagged black line;the filters ki were estimatedby maximum likelihood froma distinct training set here),and the decoded MAPestimate (smooth curve) ±1posterior s.d. (gray area). TheMAP estimate is quiteaccurate and is computed inO(T) time, where T is thestimulus duration. In thisexample, fully-BayesianMarkov chain Monte Carlodecoding produced nearlyidentical results; seeAhmadian et al. (2009a) fordetails

0 50 100 150 200 250 300 350 400 450 500time(ms)

OF

F

ON

–5

0

5

MA

P

–5

0

5

MA

P

–505

MA

P

–202

MA

P

–505

MA

P

–5

0

5

MA

P

122 J Comput Neurosci (2010) 29:107–126

Fig. 6 An example of the fastspatial estimation methodapplied to data from Czanneret al. (2008), Fig. 3. Toppanel: observed spike traindata. Note that the firing ratequalitatively changes both asa function of time t and trialnumber i; 50 trials total areobserved here. Bottom: λ(t, i)estimated using the fastregularized methodsdescribed in Rahnama et al.(2009). See Rahnama et al.(2009) for furthercomparisons, e.g. to linearkernel smoothing methods

Tria

ls

Raster Plot

Time(10ms)

Tria

ls

Est. Rate(Hz)

20 40 60 80 100 120 140 160 180 200

10

20

30

40

50

This quadratic penalty has a natural Bayesian inter-pretation: Q̂ is the MAP estimate under a “smoothing”Gaussian prior of the form

log p(Q) = −c1

∑it

[q(t, i) − q(t − dt, i)]2

−c2

∑it

[q(t, i) − q(t, i − 1)]2 + const. (16)

(Note that this specific Gaussian prior is improper,since the quadratic form in Eq. (16) does not have fullrank—the sums in log p(Q) evaluate to zero for anyconstant Q—and the prior is therefore not integrable.This can be corrected by adding an additional penaltyterm, but in practice, given enough data, the posterioris always integrable and therefore this improper priordoes not pose any serious difficulty.)

Now the key problem is to compute Q̂ effi-ciently. We proceed as before and simply applyNewton’s method to optimize Eq. (15). If we representthe unknown Q as a long vector formed by appendingthe columns of the matrix q(t, i), then the Hessian withrespect to Q still has a block-tridiagonal form, but nowthe size of the blocks scales with the number of trialsobserved, and so direct Gaussian elimination scalesmore slowly than O(N), where N is the dimensionality(i.e., the total number of pixels) of q(t, i). Nonetheless,efficient methods have been developed to handle thistype of sparse, banded matrix (which arises, for exam-ple, in applied physics applications requiring discretizedimplementations of Laplace’s or Poisson’s equation);for example, in this case Matlab’s built in H\∇ codecomputes the solution to Hx = ∇ in O(N3/2) time,which makes estimation of spatial tuning functions withN ∼ 104 easily feasible (on the order of seconds ona laptop). See Fig. 6 for an example of an estimated

two-dimensional firing rate λ(t, i), and Rahnama et al.(2009) for further details.

5 Conclusion

Since the groundbreaking work of Brown et al. (1998),state-space methods have been recognized as a keyparadigm in neuroscience data analysis. These methodshave been particularly dominant in the context of on-line decoding analyses (Brown et al. 1998; Brockwellet al. 2004; Truccolo et al. 2005; Shoham et al. 2005; Wuet al. 2006; Srinivasan et al. 2006; Kulkarni and Paninski2008) and in the analysis of plasticity and nonstationarytuning properties (Brown et al. 2001; Frank et al. 2002;Eden et al. 2004; Smith et al. 2004; Czanner et al. 2008;Lewi et al. 2009; Rahnama et al. 2009), where the needfor statistically rigorous and computationally efficientmethods for tracking a dynamic “moving target” givennoisy, indirect spike train observations has been partic-ularly acute.

The direct optimization viewpoint discussed here(and previously in Fahrmeir and Kaufmann 1991,Fahrmeir and Tutz 1994, Bell 1994, Davis andRodriguez-Yam 2005, Jungbacker and Koopman 2007,Koyama and Paninski 2009) opens up a number ofadditional interesting applications in neuroscience, andhas a couple advantages, as we have emphasized. Prag-matically, this method is perhaps a bit conceptuallysimpler and easier to code, thanks to the efficient sparsematrix methods built into Matlab and other modern nu-merical software packages. The joint optimization ap-proach makes the important extension to problems ofconstrained optimization quite transparent, as we sawin Section 2.4. We also saw that the direct techniquesoutlined in Section 3 can provide a much more efficient

J Comput Neurosci (2010) 29:107–126 123

algorithm for parameter estimation than the standardExpectation-Maximization strategy. In addition, thedirect optimization approach makes the connectionsto other standard statistical methods (spline smooth-ing, penalized maximum likelihood, isotonic regression,etc.) quite clear, and can also serve as a quick initializa-tion for more computationally-intensive methods thatmight require fewer model assumptions (e.g., on theconcavity of the loglikelihoods p(yt|qt)). Finally, thisdirect optimization setting may be generalized signifi-cantly: we have mentioned extensions of the basic ideato constrained problems, MCMC methods, and spatialsmoothing applications, all of which amply illustrate theflexibility of this approach. We anticipate that thesestate-space techniques will continue to develop in thenear future, with widespread and diverse applicationsto the analysis of neural data.

Acknowledgements We thank J. Pillow for sharing the dataused in Figs. 2 and 5, G. Czanner for sharing the data used inFig. 6, and B. Babadi and Q. Huys for many helpful discussions.LP is supported by NIH grant R01 EY018003, an NSF CAREERaward, and a McKnight Scholar award; YA by a Patterson TrustPostdoctoral Fellowship; DGF by the Gulbenkian PhD Programin Computational Biology, Fundacao para a Ciencia e TecnologiaPhD Grant ref. SFRH / BD / 33202 / 2007; SK by NIH grants R01MH064537, R01 EB005847 and R01 NS050256; JV by NIDCDDC00109.

References

Ahmadian, Y., Pillow, J., & Paninski, L. (2009a). EfficientMarkov Chain Monte Carlo methods for decoding popula-tion spike trains. Neural Computation (under review).

Ahmadian, Y., Pillow, J., Shlens, J., Chichilnisky, E., Simoncelli,E., & Paninski, L. (2009b). A decoder-based spike train met-ric for analyzing the neural code in the retina. COSYNE09.

Araya, R., Jiang, J., Eisenthal, K. B., & Yuste, R. (2006). Thespine neck filters membrane potentials. PNAS, 103(47),17961–17966.

Asif, A., & Moura, J. (2005). Block matrices with l-block bandedinverse: Inversion algorithms. IEEE Transactions on SignalProcessing, 53, 630–642.

Bell, B. M. (1994). The iterated Kalman smoother as a Gauss–Newton method. SIAM Journal on Optimization, 4, 626–636.

Borg-Graham, L., Monier, C., & Fregnac, Y. (1996). Voltage-clamp measurements of visually-evoked conductances withwhole-cell patch recordings in primary visual cortex. Journalof Physiology (Paris), 90, 185–188.

Boyd, S., & Vandenberghe, L. (2004). Convex optimization.Oxford: Oxford University Press.

Brockwell, A., Rojas, A., & Kass, R. (2004). Recursive Bayesiandecoding of motor cortical signals by particle filtering.Journal of Neurophysiology, 91, 1899–1907.

Brown, E., Frank, L., Tang, D., Quirk, M., & Wilson, M. (1998).A statistical paradigm for neural spike train decoding ap-plied to position prediction from ensemble firing patternsof rat hippocampal place cells. Journal of Neuroscience, 18,7411–7425.

Brown, E., Kass, R., & Mitra, P. (2004). Multiple neural spiketrain data analysis: State-of-the-art and future challenges.Nature Neuroscience, 7, 456–461.

Brown, E., Nguyen, D., Frank, L., Wilson, M., & Solo, V. (2001).An analysis of neural receptive field plasticity by pointprocess adaptive filtering. PNAS, 98, 12261–12266.

Chornoboy, E., Schramm, L., & Karr, A. (1988). Maximumlikelihood identification of neural point process systems.Biological Cybernetics, 59, 265–275.

Coleman, T., & Sarma, S. (2007). A computationally efficientmethod for modeling neural spiking activity with pointprocesses nonparametrically. IEEE Conference on Decisionand Control.

Cossart, R., Aronov, D., & Yuste, R. (2003). Attractor dynamicsof network up states in the neocortex. Nature, 423, 283–288.

Cox, D. (1955). Some statistical methods connected with series ofevents. Journal of the Royal Statistical Society, Series B, 17,129–164.

Cunningham, J. P., Shenoy, K. V., & Sahani, M. (2008). FastGaussian process methods for point process intensity esti-mation. ICML, 192–199.

Czanner, G., Eden, U., Wirth, S., Yanike, M., Suzuki, W., &Brown, E. (2008). Analysis of between-trial and within-trialneural spiking dynamics. Journal of Neurophysiology, 99,2672–2693.

Davis, R., & Rodriguez-Yam, G. (2005). Estimation for state-space models: An approximate likelihood approach. Statis-tica Sinica, 15, 381–406.

Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likeli-hood from incomplete data via the EM algorithm. Journal ofthe Royal Statistical Society, Series B, 39, 1–38.

DiMatteo, I., Genovese, C., & Kass, R. (2001). Bayesian curvefitting with free-knot splines. Biometrika, 88, 1055–1073.

Djurisic, M., Popovic, M., Carnevale, N., & Zecevic, D. (2008).Functional structure of the mitral cell dendritic tuft inthe rat olfactory bulb. Journal of Neuroscience, 28(15),4057–4068.

Donoghue, J. (2002). Connecting cortex to machines: Recentadvances in brain interfaces. Nature Neuroscience, 5, 1085–1088.

Doucet, A., de Freitas, N., & Gordon, N. (Eds.) (2001). Sequen-tial Monte Carlo in practice. New York: Springer.

Durbin, J., & Koopman, S. (2001). Time series analysis by statespace methods. Oxford: Oxford University Press.

Eden, U. T., Frank, L. M., Barbieri, R., Solo, V., & Brown,E. N. (2004). Dynamic analyses of neural encoding bypoint process adaptive filtering. Neural Computation, 16,971–998.

Ergun, A., Barbieri, R., Eden, U., Wilson, M., & Brown, E.(2007). Construction of point process adaptive filter algo-rithms for neural systems using sequential Monte Carlomethods. IEEE Transactions on Biomedical Engineering, 54,419–428.

Escola, S., & Paninski, L. (2009). Hidden Markov models appliedtoward the inference of neural states and the improved es-timation of linear receptive fields. Neural Computation (un-der review).

Fahrmeir, L., & Kaufmann, H. (1991). On Kalman filtering, pos-terior mode estimation and fisher scoring in dynamic expo-nential family regression. Metrika, 38, 37–60.

Fahrmeir, L., & Tutz, G. (1994). Multivariate statistical modellingbased on generalized linear models. New York: Springer.

Frank, L., Eden, U., Solo, V., Wilson, M., & Brown, E. (2002).Contrasting patterns of receptive field plasticity in the hip-pocampus and the entorhinal cortex: An adaptive filteringapproach. Journal of Neuroscience, 22(9), 3817–3830.

124 J Comput Neurosci (2010) 29:107–126

Gao, Y., Black, M., Bienenstock, E., Shoham, S., & Donoghue,J. (2002). Probabilistic inference of arm motion from neuralactivity in motor cortex. NIPS, 14, 221–228.

Gat, I., Tishby, N., & Abeles, M. (1997). Hidden Markov model-ing of simultaneously recorded cells in the associative cortexof behaving monkeys. Network: Computation in Neural Sys-tems, 8, 297–322.

Godsill, S., Doucet, A., & West, M. (2004). Monte Carlo smooth-ing for non-linear time series. Journal of the American Statis-tical Association, 99, 156–168.

Green, P., & Silverman, B. (1994). Nonparametric regression andgeneralized linear models. Boca Raton: CRC.

Hawkes, A. (2004). Stochastic modelling of single ion channels.In J. Feng (Ed.), Computational neuroscience: A comprehen-sive approach (pp. 131–158). Boca Raton: CRC.

Herbst, J. A., Gammeter, S., Ferrero, D., & Hahnloser, R. H.(2008). Spike sorting with hidden markov models. Journalof Neuroscience Methods, 174(1), 126–134.

Huys, Q., Ahrens, M., & Paninski, L. (2006). Efficient estimationof detailed single-neuron models. Journal of Neurophysiol-ogy, 96, 872–890.

Huys, Q., & Paninski, L. (2009). Model-based smoothing of, andparameter estimation from, noisy biophysical recordings.PLOS Computational Biology, 5, e1000379.

Iyengar, S. (2001). The analysis of multiple neural spike trains.In Advances in methodological and applied aspects of prob-ability and statistics (pp. 507–524). New York: Gordon andBreach.

Jones, L. M., Fontanini, A., Sadacca, B. F., Miller, P., & Katz, D.B. (2007). Natural stimuli evoke dynamic sequences of statesin sensory cortical ensembles. Proceedings of the NationalAcademy of Sciences, 104, 18772–18777.

Julier, S., & Uhlmann, J. (1997). A new extension of the Kalmanfilter to nonlinear systems. In Int. Symp. Aerospace/DefenseSensing, Simul. and Controls. Orlando, FL.

Jungbacker, B., & Koopman, S. (2007). Monte Carlo estimationfor nonlinear non-Gaussian state space models. Biometrika,94, 827–839.

Kass, R., & Raftery, A. (1995). Bayes factors. Journal of theAmerican Statistical Association, 90, 773–795.

Kass, R., Ventura, V., & Cai, C. (2003). Statistical smoothing ofneuronal data. Network: Computation in Neural Systems, 14,5–15.

Kass, R. E., Ventura, V., & Brown, E. N. (2005). Statistical issuesin the analysis of neuronal data. Journal of Neurophysiology,94, 8–25.

Kelly, R., & Lee, T. (2004). Decoding V1 neuronal activity usingparticle filtering with Volterra kernels. Advances in NeuralInformation Processing Systems, 15, 1359–1366.

Kemere, C., Santhanam, G., Yu, B. M., Afshar, A., Ryu,S. I., Meng, T. H., et al. (2008). Detecting neural-statetransitions using hidden Markov models for motor cor-tical prostheses. Journal of Neurophysiology, 100, 2441–2452.

Khuc-Trong, P., & Rieke, F. (2008). Origin of correlated activitybetween parasol retinal ganglion cells. Nature Neuroscience,11, 1343–1351.

Kitagawa, G., & Gersch, W. (1996). Smoothness priors analysis oftime series. Lecture notes in statistics (Vol. 116). New York:Springer.

Koch, C. (1999). Biophysics of computation. Oxford: OxfordUniversity Press.

Koyama, S., & Paninski, L. (2009). Efficient computationof the maximum a posteriori path and parameter es-timation in integrate-and-fire and more general state-

space models. Journal of Computational Neurosciencedoi:10.1007/s10827-009-0150-x.

Kulkarni, J., & Paninski, L. (2007). Common-input models formultiple neural spike-train data. Network: Computation inNeural Systems, 18, 375–407.

Kulkarni, J., & Paninski, L. (2008). Efficient analytic compu-tational methods for state-space decoding of goal-directedmovements. IEEE Signal Processing Magazine, 25(specialissue on brain-computer interfaces), 78–86.

Lewi, J., Butera, R., & Paninski, L. (2009). Sequential optimal de-sign of neurophysiology experiments. Neural Computation,21, 619–687.

Litke, A., Bezayiff, N., Chichilnisky, E., Cunningham, W.,Dabrowski, W., Grillo, A., et al. (2004). What does the eyetell the brain? Development of a system for the large scalerecording of retinal output activity. IEEE Transactions onNuclear Science, 1434–1440.

Martignon, L., Deco, G., Laskey, K., Diamond, M., Freiwald, W.,& Vaadia, E. (2000). Neural coding: Higher-order tempo-ral patterns in the neuro-statistics of cell assemblies. NeuralComputation, 12, 2621–2653.

Meng, X.-L., & Rubin, D. B. (1991). Using EM to obtain as-ymptotic variance-covariance matrices: The SEM algorithm.Journal of the American Statistical Association, 86(416), 899–909.

Minka, T. (2001). A family of algorithms for ApproximateBayesian Inference. PhD thesis, MIT.

Moeller, J., Syversveen, A., & Waagepetersen, R. (1998). Log-Gaussian Cox processes. Scandinavian Journal of Statistics,25, 451–482.

Moeller, J., & Waagepetersen, R. (2004). Statistical inferenceand simulation for spatial point processes. London: ChapmanHall.

Murphy, G., & Rieke, F. (2006). Network variability limitsstimulus-evoked spike timing precision in retinal ganglioncells. Neuron, 52, 511–524.

Neal, R., & Hinton, G. (1999). A view of the EM algorithmthat justifies incremental, sparse, and other variants. In M.Jordan (Ed.), Learning in graphical models (pp. 355–368).Cambridge: MIT.

Nicolelis, M., Dimitrov, D., Carmena, J., Crist, R., Lehew, G.,Kralik, J., et al. (2003). Chronic, multisite, multielectroderecordings in macaque monkeys. PNAS, 100, 11041–11046.

Nikolenko, V., Watson, B., Araya, R., Woodruff, A., Peterka, D.,& Yuste, R. (2008). SLM microscopy: Scanless two-photonimaging and photostimulation using spatial light modulators.Frontiers in Neural Circuits, 2, 5.

Nykamp, D. (2005). Revealing pairwise coupling in linear-nonlinear networks. SIAM Journal on Applied Mathematics,65, 2005–2032.

Nykamp, D. (2007). A mathematical framework for inferringconnectivity in probabilistic neuronal networks. Mathemat-ical Biosciences, 205, 204–251.

Ohki, K., Chung, S., Ch’ng, Y., Kara, P., & Reid, C. (2005).Functional imaging with cellular resolution reveals precisemicro-architecture in visual cortex. Nature, 433, 597–603.

Olsson, R. K., Petersen, K. B., & Lehn-Schioler, T. (2007). State-space models: From the EM algorithm to a gradient ap-proach. Neural Computation, 19, 1097–1111.

Paninski, L. (2004). Maximum likelihood estimation of cascadepoint-process neural encoding models. Network: Computa-tion in Neural Systems, 15, 243–262.

Paninski, L. (2005). Log-concavity results on Gaussian processmethods for supervised and unsupervised learning. Ad-vances in Neural Information Processing Systems, 17.

J Comput Neurosci (2010) 29:107–126 125

Paninski, L. (2009). Inferring synaptic inputs given a noisy volt-age trace via sequential Monte Carlo methods. Journal ofComputational Neuroscience (under review).

Paninski, L., Fellows, M., Shoham, S., Hatsopoulos, N., &Donoghue, J. (2004). Superlinear population encoding ofdynamic hand trajectory in primary motor cortex. Journalof Neuroscience, 24, 8551–8561.

Paninski, L., & Ferreira, D. (2008). State-space methods for in-ferring synaptic inputs and weights. COSYNE.

Peña, J.-L. & Konishi, M. (2000). Cellular mechanisms for resolv-ing phase ambiguity in the owl’s inferior colliculus. Proceed-ings of the National Academy of Sciences of the United Statesof America, 97, 11787–11792.

Penny, W., Ghahramani, Z., & Friston, K. (2005). Bilinear dy-namical systems. Philosophical Transactions of the RoyalSociety of London, 360, 983–993.

Pillow, J., Ahmadian, Y., & Paninski, L. (2009). Model-based de-coding, information estimation, and change-point detectionin multi-neuron spike trains. Neural Computation (under re-view).

Pillow, J., Shlens, J., Paninski, L., Sher, A., Litke, A.,Chichilnisky, E., et al. (2008). Spatiotemporal correlationsand visual signaling in a complete neuronal population.Nature, 454, 995–999.

Press, W., Teukolsky, S., Vetterling, W., & Flannery, B. (1992).Numerical recipes in C. Cambridge: Cambridge UniversityPress.

Priebe, N., & Ferster, D. (2005). Direction selectivity of excita-tion and inhibition in simple cells of the cat primary visualcortex. Neuron, 45, 133–145.

Rabiner, L. (1989). A tutorial on hidden Markov models andselected applications in speech recognition. Proceedings ofthe IEEE, 77, 257–286.

Rahnama, K., Rad & Paninski, L. (2009). Efficient estimationof two-dimensional firing rate surfaces via Gaussian processmethods. Network (under review).

Rasmussen, C., & Williams, C. (2006). Gaussian processes formachine learning. Cambridge: MIT.

Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek,W. (1997). Spikes: Exploring the neural code. Cambridge:MIT.

Robert, C., & Casella, G. (2005). Monte Carlo statistical methods.New York: Springer.

Roweis, S., & Ghahramani, Z. (1999). A unifying review of linearGaussian models. Neural Computation, 11, 305–345.

Rybicki, G., & Hummer, D. (1991). An accelerated lambda it-eration method for multilevel radiative transfer, appendixb: Fast solution for the diagonal elements of the inverseof a tridiagonal matrix. Astronomy and Astrophysics, 245,171.

Rybicki, G. B., & Press, W. H. (1995). Class of fast methods forprocessing irregularly sampled or otherwise inhomogeneousone-dimensional data. Physical Review Letters, 74(7), 1060–1063.

Salakhutdinov, R., Roweis, S. T., & Ghahramani, Z. (2003).Optimization with EM and expectation-conjugate-gradient.International Conference on Machine Learning, 20, 672–679.

Schneidman, E., Berry, M., Segev, R., & Bialek, W. (2006).Weak pairwise correlations imply strongly correlated net-work states in a neural population. Nature, 440, 1007–1012.

Schnitzer, M., & Meister, M. (2003). Multineuronal firing pat-terns in the signal from eye to brain. Neuron, 37, 499–511.

Shlens, J., Field, G. D., Gauthier, J. L., Grivich, M. I., Petrusca,D., Sher, A., et al. (2006). The structure of multi-neuron

firing patterns in primate retina. Journal of Neuroscience, 26,8254–8266.

Shlens, J., Field, G. D., Gauthier, J. L., Greschner, M., Sher,A., Litke, A. M., et al. (2009). The structure of large-scalesynchronized firing in primate retina. Journal of Neuro-science, 29, 5022–5031.

Shoham, S., Paninski, L., Fellows, M., Hatsopoulos, N.,Donoghue, J., & Normann, R. (2005). Optimal decoding fora primary motor cortical brain-computer interface. IEEETransactions on Biomedical Engineering, 52, 1312–1322.

Shumway, R., & Stoffer, D. (2006). Time series analysis and itsapplications. New York: Springer.

Silvapulle, M., & Sen, P. (2004). Constrained statistical infer-ence: Inequality, order, and shape restrictions. New York:Wiley-Interscience.

Smith, A., & Brown, E. (2003). Estimating a state-space modelfrom point process observations. Neural Computation, 15,965–991.

Smith, A. C., Frank, L. M., Wirth, S., Yanike, M., Hu, D., Kubota,Y., et al. (2004). Dynamic analysis of learning in behavioralexperiments. Journal of Neuroscience, 24(2), 447–461.

Smith, A. C., Stefani, M. R., Moghaddam, B., & Brown, E. N.(2005). Analysis and design of behavioral experiments tocharacterize population learning. Journal of Neurophysiol-ogy, 93(3), 1776–1792.

Snyder, D., & Miller, M. (1991). Random point processes in timeand space. New York: Springer.

Srinivasan, L., Eden, U., Willsky, A., & Brown, E. (2006).A state-space analysis for reconstruction of goal-directedmovements using neural signals. Neural Computation, 18,2465–2494.

Suzuki, W. A., & Brown, E. N. (2005). Behavioral and neuro-physiological analyses of dynamic learning processes. Behav-ioral & Cognitive Neuroscience Reviews, 4(2), 67–95.

Truccolo, W., Eden, U., Fellows, M., Donoghue, J., & Brown,E. (2005). A point process framework for relating neuralspiking activity to spiking history, neural ensemble and ex-trinsic covariate effects. Journal of Neurophysiology, 93,1074–1089.

Utikal, K. (1997). A new method for detecting neural intercon-nectivity. Biological Cyberkinetics, 76, 459–470.

Vidne, M., Kulkarni, J., Ahmadian, Y., Pillow, J., Shlens, J.,Chichilnisky, E., et al. (2009). Inferring functional connectiv-ity in an ensemble of retinal ganglion cells sharing a commoninput. COSYNE.

Vogelstein, J., Babadi, B., Watson, B., Yuste, R., & Paninski,L. (2008). Fast nonnegative deconvolution via tridiagonalinterior-point methods, applied to calcium fluorescence data.Statistical analysis of neural data (SAND) conference.

Vogelstein, J., Watson, B., Packer, A., Jedynak, B., Yuste,R., & Paninski, L., (2009). Model-based optimal infer-ence of spike times and calcium dynamics given noisyand intermittent calcium-fluorescence imaging. Biophys-ical Journal. http://www.stat.columbia.edu/ liam/research/abstracts/vogelsteinbj08- abs.html.

Wahba, G. (1990). Spline models for observational data.Philadelphia: SIAM.

Wang, X., Wei, Y., Vaingankar, V., Wang, Q., Koepsell, K.,Sommer, F., & Hirsch, J. (2007). Feedforward excitationand inhibition evoke dual modes of firing in the cat’svisual thalamus during naturalistic viewing. Neuron, 55, 465–478.

Warland, D., Reinagel, P., & Meister, M. (1997). Decoding vi-sual information from a population of retinal ganglion cells.Journal of Neurophysiology, 78, 2336–2350.

126 J Comput Neurosci (2010) 29:107–126

Wehr, M., & Zador, A., (2003). Balanced inhibition underliestuning and sharpens spike timing in auditory cortex. Nature,426, 442–446.

West, M., & Harrison, P., (1997). Bayesian forecasting and dy-namic models. New York: Springer.

Wu, W., Gao, Y., Bienenstock, E., Donoghue, J. P., & Black,M. J. (2006). Bayesian population coding of motor corti-cal activity using a Kalman filter. Neural Computation, 18,80–118.

Wu, W., Kulkarni, J., Hatsopoulos, N., & Paninski, L. (2009).Neural decoding of goal-directed movements using a linearstatespace model with hidden states. IEEE Transactions onBiomedical Engineering (in press).

Xie, R., Gittelman, J. X., & Pollak, G. D. (2007). Rethinkingtuning: In vivo whole-cell recordings of the inferior colliculusin awake bats. Journal of Neuroscience, 27(35), 9469–9481.

Ypma, A., & Heskes, T., (2003). Iterated extended Kalmansmoothing with expectation-propagation. Neural Networksfor Signal Processing, 2003, 219–228.

Yu, B., Afshar, A., Santhanam, G., Ryu, S., Shenoy, K., &Sahani, M. (2006). Extracting dynamical structure embeddedin neural activity. NIPS.

Yu, B. M., Cunningham, J. P., Shenoy, K. V., & Sahani, M. (2007).Neural decoding of movements: From linear to nonlineartrajectory models. ICONIP, 586–595.

Yu, B. M., Kemere, C., Santhanam, G., Afshar, A., Ryu, S. I.,Meng, T. H., et al. (2007). Mixture of trajectory modelsfor neural decoding of goal-directed movements. Journal ofNeurophysiology, 97(5), 3763–3780.

Yu, B. M., Shenoy, K. V., & Sahani, M. (2006). Expectation prop-agation for inference in non-linear dynamical models withPoisson observations. In Proceedings of the nonlinear sta-tistical signal processing workshop (pp. 83–86). Piscataway:IEEE.

Zhang, K., Ginzburg, I., McNaughton, B., & Sejnowski, T. (1998).Interpreting neuronal population activity by reconstruction:Unified framework with application to hippocampal placecells. Journal of Neurophysiology, 79, 1017–1044.