The Ising decoder: reading out the activity of large neural ensembles

23
The Ising decoder: reading out the activity of large neural ensembles Michael T Schaub and Simon R Schultz Department of Bioengineering, Imperial College London, South Kensington, London SW7 2AZ, UK. [email protected] May 24, 2011 Abstract The Ising Model has recently received much attention for the statistical description of neural spike train data. In this paper, we propose and demonstrate its use for building de- coders capable of predicting, on a millisecond timescale, the stimulus represented by a pattern of neural activity. After fitting to a training dataset, the Ising decoder can be applied “online” for instantaneous decoding of test data. While such models can be fit exactly using Boltz- mann learning, this approach rapidly becomes compu- tationally intractable as neural ensemble size increases. We show that several approaches, including the Thouless- Anderson-Palmer (TAP) mean field approach from statis- tical physics, and the recently developed Minimum Prob- ability Flow Learning (MPFL) algorithm, can be used for rapid inference of model parameters in large-scale neural ensembles. Use of the Ising model for decoding, unlike other problems such as functional connectivity estima- tion, requires estimation of the partition function. As this involves summation over all possible responses, this step can be limiting. Mean field approaches avoid this prob- lem by providing an analytical expression for the parti- tion function. We demonstrate these decoding techniques by applying them to simulated neural ensemble responses from a mouse visual cortex model, finding an improve- ment in decoder performance for a model with heteroge- neous as opposed to homogeneous neural tuning and re- sponse properties. Our results demonstrate the practical- ity of using the Ising model to read out, or decode, spatial patterns of activity comprised of many hundreds of neu- rons. 1 Introduction Interpreting the patterns of activity fired by populations of neurons is one of the central challenges of modern sys- tems neuroscience. The design of decoding algorithms capable of millisecond-by-millisecond readout of sensory or behavioural correlates of neuronal activity patterns would be a valuable step in this direction. Such decoding algorithms, as well as helping us to understand the neural code, may have further practical application, as the basis of communication neural prostheses for severely disabled patients such as those with ”Locked In” syndrome. At the heart of such a decoding algorithm must lie - whether explicit or implicit - a description of the con- ditional probability distribution of activity patterns given stimuli or behaviours. Making this description is nontriv- ial, as the brain, like other biological systems, exhibits enormous complexity. This results in a very large number of possible states or configurations exhibited by the sys- tem, making the description of such systems by simply measuring the probabilities of each state unfeasible. Ex- cept for very small patterns, a model-based approach of some kind is essential. New technologies in neuroscience such as high-density multi-electrode array recording and multi-photon calcium imaging now make it possible to monitor the activity of 1 arXiv:1009.1828v3 [q-bio.NC] 23 May 2011

Transcript of The Ising decoder: reading out the activity of large neural ensembles

The Ising decoder: reading out the activity of large neuralensembles

Michael T Schaub and Simon R SchultzDepartment of Bioengineering, Imperial College London,

South Kensington, London SW7 2AZ, [email protected]

May 24, 2011

Abstract

The Ising Model has recently received much attention forthe statistical description of neural spike train data. In thispaper, we propose and demonstrate its use for building de-coders capable of predicting, on a millisecond timescale,the stimulus represented by a pattern of neural activity.After fitting to a training dataset, the Ising decoder canbe applied “online” for instantaneous decoding of testdata. While such models can be fit exactly using Boltz-mann learning, this approach rapidly becomes compu-tationally intractable as neural ensemble size increases.We show that several approaches, including the Thouless-Anderson-Palmer (TAP) mean field approach from statis-tical physics, and the recently developed Minimum Prob-ability Flow Learning (MPFL) algorithm, can be used forrapid inference of model parameters in large-scale neuralensembles. Use of the Ising model for decoding, unlikeother problems such as functional connectivity estima-tion, requires estimation of the partition function. As thisinvolves summation over all possible responses, this stepcan be limiting. Mean field approaches avoid this prob-lem by providing an analytical expression for the parti-tion function. We demonstrate these decoding techniquesby applying them to simulated neural ensemble responsesfrom a mouse visual cortex model, finding an improve-ment in decoder performance for a model with heteroge-neous as opposed to homogeneous neural tuning and re-sponse properties. Our results demonstrate the practical-ity of using the Ising model to read out, or decode, spatial

patterns of activity comprised of many hundreds of neu-rons.

1 Introduction

Interpreting the patterns of activity fired by populationsof neurons is one of the central challenges of modern sys-tems neuroscience. The design of decoding algorithmscapable of millisecond-by-millisecond readout of sensoryor behavioural correlates of neuronal activity patternswould be a valuable step in this direction. Such decodingalgorithms, as well as helping us to understand the neuralcode, may have further practical application, as the basisof communication neural prostheses for severely disabledpatients such as those with ”Locked In” syndrome.

At the heart of such a decoding algorithm must lie -whether explicit or implicit - a description of the con-ditional probability distribution of activity patterns givenstimuli or behaviours. Making this description is nontriv-ial, as the brain, like other biological systems, exhibitsenormous complexity. This results in a very large numberof possible states or configurations exhibited by the sys-tem, making the description of such systems by simplymeasuring the probabilities of each state unfeasible. Ex-cept for very small patterns, a model-based approach ofsome kind is essential.

New technologies in neuroscience such as high-densitymulti-electrode array recording and multi-photon calciumimaging now make it possible to monitor the activity of

1

arX

iv:1

009.

1828

v3 [

q-bi

o.N

C]

23

May

201

1

large numbers of neurons simultaneously. Analysis toolsfor such high dimensional data have however lagged be-hind the experimental technology, as most approaches arelimited to very small population sizes. While consider-able advances have been made in the use of information-theoretic approaches to characterise the statistical struc-ture of small neural ensembles (Gawne et al. 1996, Panz-eri, Schultz, Treves & Rolls 1999, Schultz & Panzeri2001, Panzeri & Schultz 2001, Reich et al. 2001, Petersenet al. 2001, Pola et al. 2003, Montani et al. 2007), finitesampling limitations have made results for larger ensem-bles much more difficult to obtain.

For the statistical description of multivariate neuralspike train data, parametric models able to capture mostof the interesting features of real data while still beingof empirically accessible dimensionality are highly desir-able. One promising approach has emerged from statis-tical mechanics: the use of Ising (or Ising-like) models,exploiting an analogy between populations of spike trainsand ensembles of interacting magnetic spins (Shlens et al.2006, Shlens et al. 2009, Schneidman et al. 2006).

Our aim here is to devise an algorithm for “millisecond-by-millisecond” neural decoding, on the basis that in-formation processing in the nervous system appears tomake use of such fine temporal scales (Carr 1993, Bair& Koch 1996). The timescale of the “symbols” used ininformation processing is thus likely to be somewhere be-tween 1 and 20 ms for most purposes (Butts et al. 2007).For time bins on this scale, neural spike trains are ef-fectively binarized, and the simplest binary model (inthe maximum entropy sense) that captures pairwise cor-relations is the Ising model. The Ising model is thusa natural way to describe the statistics of neural spikepatterns at the timescale of interest. Fitting of such amodel to the observed neural data has the advantage thatit does not implicitly assume some non-measured struc-ture in the data, i.e. maximum entropy models express themost uncertainty about the modelled data given the (ex-plicit) chosen constraints (e.g. that certain moments of themeasured distribution agree with the model distribution)(Jaynes 1957). It can be shown that this is mathematicallyequivalent to maximizing the likelihood of the model pa-rameters to explain the observed data (Berger et al. 1996).By using this approach to fit a model to the conditional ac-tivity pattern distribution, in conjunction with maximum aposteriori decoding (Foldiak 1993, Oram et al. 1998), it is

possible to train a decoder which takes as its input a pat-tern of spiking activity, and gives as its output the stimulusthat it determines to have elicited that spike pattern.

A major obstacle to the use of Ising models for neu-ral decoding, is that, in general, it is necessary to com-pute a partition function (or normalization factor), involv-ing a sum over all possible states. This can be numeri-cally challenging, and for large numbers of neurons, un-feasible. In the present study, we adopted several ap-proaches for circumventing this problem. Firstly, wemake use of mean field approximations, including boththe ‘naive’ mean field approximation and the Thoulesset al. (1977) (TAP) extension to it, following Roudi, Tyr-cha & Hertz (2009). Secondly, we compare this withthe recently proposed Minimum Probability Flow Method(Sohl-Dickstein et al. 2009) for learning model param-eters. To assess the relative performance of these ap-proaches in the context of a discrete decoding problem,we simulated the activity of a population of neurons inlayer V of the mouse visual cortex during an experimentin which a discrete set of orientation stimuli were pre-sented. Using this simulation, we evaluated the relativeperformance characteristics of the different decoding al-gorithms in the face of limited data, exploring decodingregimes with up to 1000 neurons. We demonstrate, forthe first time, the use of the Ising model to effectively de-code discrete stimulus states from large-scale simulatedneural population activity.

2 Methods

2.1 Ising models of neural spike trainsActivity states in an Ising model are Boltzmann dis-tributed, i.e. they are distributed according to the negativeexponential of the “energy” associated with each state.This distribution,P ∝ e−

∑µ λµfµ , is the maximum entropy distribution

subject to the set of constraints imposed by Lagrange mul-tiplers λµ on variables fµ. Imposing these constraintsupon firing rates and pairwise correlations gives

pIsing(r|s) =1

Z(s)exp

∑i

hi(s)ri +1

2

∑i 6=j

Jij(s)rirj

,

(1)

2

where r = (r1, r2, . . . , rC)T and each binary response

variable ri ∈ {0, 1} indicates the firing/not firing of neu-ron i in the observed time interval. The parameters hi(known in statistical physics as ‘external fields’) and Jij(‘pairwise couplings’) have to be fit to the data such thatthe model displays the same means and pairwise correla-tions as the data:

〈ri〉Ising = 〈ri〉Data , (2a)

〈rirj〉Ising = 〈rirj〉Data , (2b)

where 〈·〉model denotes expectation with respect to thespecified distribution. Z(s) is the partition function,which acts as a normalisation factor. i.e.:

Z(s) =∑r∈R

exp

∑i

hi(s)ri +1

2

∑i 6=j

Jij(s)rirj

.

(3)Note that the first sum is over all possible (as opposed toobserved) responses, given by the setR.

In statistical physics it is more common to use a sym-metric representation σi ∈ {−1, 1} for the ‘spins’ thatdescribe the activation of neuron i (with −1 indicating‘no spike’ and 1 indicating ‘spike’), which simply corre-sponds to a change of variables σi = 2ri − 1. Accord-ingly the fields, couplings and partition functions change.As it is occasionally more convenient to work in one orthe other representation we will denote the fields and cou-plings in the spin representation with hi and Jij .

Standard Monte Carlo techniques for fitting thesemodel parameters, such as Boltzmann learning, whichcan in principle provide an exact solution - given thenumber of samples is high enough - become computa-tionally very expensive if not intractable as the num-ber of cells increases. We have found in previous work(Seiler et al. 2009), that the Boltzmann learning approachbecomes computationally too expensive in our case forensemble sizes larger than 30 cells. This poor scalingbehaviour is mainly due to the exponentially increasingnumber of states with the number of cells. Speeding upthe model fitting process is hence an essential requirementto utilize Ising models for studies with large ensemblesof neurons. Solutions to speed up the “classical” Boltz-mann learning approach have been suggested (Brodericket al. 2007). However these are still associated with a highcomputational cost.

2.2 Neural DecodingIn this paper we consider the problem of decoding whichof a number of different stimuli has elicited a neural spikepattern. This can be seen as a discrete classification task:we have a set of S stimuli s ∈ S = {s1, s2, . . . , sS}. De-coding in this scenario means that we have to provide adecision rule that estimates which stimulus has been theinput to the system, given an observed spike pattern robs.The particular example to which we apply this is a simula-tion of the spike pattern responses elicited by visual stim-uli across the receptive field of a visual cortical neuron:in this case each stimulus si represents a different orien-tation of a sinusoidal grating. Our main aim with this sim-ulation was to validate our methodology in a neurophys-iologically realistic coding regime, relevant to datasets towhich our methodology might be applied. As a supple-mentary goal, we hoped to gain some insight into whethersome aspects of the model affect decoding performance- such as heterogeneity of tuning, observed in real neuralrecordings but often ignored in population coding models.

For decoding, we use the maximum a posteriori (MAP)rule:

s = argmaxsp(s|robs) (4)

= argmaxs

p(robs|s)p(s)p(robs)

(5)

= argmaxsp(robs|s)p(s), (6)

where the second step is the application of Bayes’ theo-rem and the third equality holds because p(robs) is inde-pendent of s and is hence irrelevant for maximising thegiven expression, i.e. just a constant factor with respectto s that scales the maximum accordingly. In the case weexamine here we assume we are in control of the stimulusdistribution p(s), and thus we can choose it to be uniform,i.e. to exhibit the same constant probability for each stim-ulus and therefore be independent of s, as well. Henceour decoding rule simplifies further to the maximum like-lihood (ML) rule:

s = argmaxsp(robs|s). (7)

Within this setting, the task of creating a neural decoderreduces to the modelling of the stimulus dependent dis-tributions p(r|s). Once these are obtained, we can apply

3

our ML decoding rule (Equation 7) to estimate the giveninput stimulus s.

We have used two different statistical models to fit theobserved spike patterns for each stimulus. Firstly, we haveused an Ising model for p(r|s), i.e. we assume that foreach stimulus, the spike pattern distribution can be de-scribed by a (different) Ising model. Secondly, we haveused an independent model distribution pind, assumingthat given a stimulus, each cell is independent of the oth-ers:

pind(r|s) =C∏i=1

p(ri|s). (8)

The independent model is the binary maximum entropymodel of first order, i.e. it takes into account only thefirst order moments (the constraints on the means givenby Equation 2a) and is therefore a natural comparison forthe Ising model. As it is very easy to fit the independentmodel, we used this as a control method, to test whetherthe more complex Ising model could enhance decodingperformance. Note that the numerical values for the prob-abilities can get very small for large cell ensembles, andtherefore to evade finite precision problems we use in thiscase an equivalent log-likelihood decoding rule instead ofthe ML rule, i.e. maximise the logarithm of the likelihoodinstead of maximising the likelihood directly.

2.3 Training and TestingTo train and test the decoders, we proceed as follows:

1. For each stimulus we simulate a set of possible re-sponse vectors. The details of the simulation are de-scribed in the following subsection.

2. We separate the simulated response patterns intotraining data, which is used to fit the model and testdata which we use to evaluate the decoding perfor-mance of the obtained models.

3. The whole testing procedure is performed with 10fold cross-validation, i.e. we divide the whole datafor each stimulus into 10 equally sized parts. Wethen use 9 parts of the data to train our model andthe remaining one for testing. We repeat this pro-cess again with all 10 possible test/training data setcombinations of this kind to reveal if our results gen-eralize to the whole dataset.

2.4 Simulation of Evoked Spike Patternsfrom Mouse Primary Visual Cortex

We simulated the transient response patterns of activityevoked by visual (orientation) stimuli in layer V pyrami-dal neurons of the anaesthetized mouse visual cortex. Theorientation direction were chosen to be n · 180/S, whereS is the number of stimuli and n ∈ {0, 1, . . . , S−1}. Theproperties of our simulation are motivated by the resultsreported by Niell & Stryker (2008). We simulated dif-ferent models by augmenting a basic model with mostlyhomogeneous response characteristics, with some comecontrolled heterogeneous characteristics.

Our model is defined as follows, with parameters spec-ified in Table 1. The spontaneous activity of each neu-ron was set to 1.7 spikes per second, corresponds to thereported median value for layer V neurons in (Niell &Stryker 2008). We assumed that neuronal direction pref-erences were uniformly spaced around the circle. Eachneuron’s tuning curve was defined by a von Mises func-tion (circular Gaussian) with half width at half maximum(HWHM) fit to experimental data (Niell & Stryker 2008).The direction selectivity index (DSI) was set to 0.1 for alllayer 5 neurons in our model. Sustained firing rates werefit to the distributions reported in (Niell & Stryker 2008).To reflect that we are considering a situation in which astimulus is decoded from a short time window (20 ms) ofdata, we multiplied these evoked rates by a fixed transient-to-sustained ratio of 1.5, taken to reflect the onset re-sponse of the neuron’s response to a flashed stimulus. Asour model is fit to data from directional (drifting grat-ing) stimuli, we took the arithmetic mean value of the twocorresponding diametrically opposite directions for eachneuron to compute the model response to a flashed orien-tation.

The characteristics of the basic model were modulatedvia two inhomogeneity parameters γ, ξ ∈ [0, 1], to in-troduce a heterogeneous distribution of firing rates andthe tuning widths respectively, as described in Table 1.We neglected inhomogeneity in other parameters. Thusthe parameter γ regulates firing rate heterogeneity, whereγ = 0 corresponded to the basic homogeneous mode,land γ = 1 to the most heterogeneous firing rate setting.The parameter ξ was used analogously to regulate hetero-geneity in the tuning widths of the neurons. The effectsof the two heterogeneity parameters γ, ξ are illustrated in

4

Parameter Value Comments

Preferred tuning direction uniformly spaced 0◦−360◦Tuning width (HWHM) 38 + ξΦi ξ ∈ [0, 1]; Φi normally distributed, fitted to Fig. 4f of Niell &

Stryker (2008), truncated minimum of 10.Direction Selectivity Index 0.1 fixed for all neurons; Niell & Stryker (2008) Fig. 5c.Spontaneous firing rate 1.7 + γXi (Hz) γ ∈ [0, 1]; Xi normally distributed about zero, distribution fitted to

Fig. 8d of Niell & Stryker (2008); truncated to minimum of 0.3Sustained evoked firing rate 7 + γYi (Hz) γ ∈ [0, 1]; Yi normally distributed, fitted to data in Fig. 8 and supp.

fig. S3 of Niell & Stryker (2008); truncated to minimum of zeroTransient-to-sustained ratio 1.5 fixed for all neurons

Table 1: Parameters of the mouse visual cortical model simulated. γ and ε determine the extent of heterogeneity inthe model, with γ, ε = 0 setting the model properties to homogeneous.

Figure 1.Patterns of spikes fired by the neural population

were simulated using a dichotomized Gaussian approach(Macke et al. 2009). Since we cannot estimate covariancematrices from experimental data directly, and not everypositive definite symmetric matrix can be used as the co-variance matrix of a multivariate binary distribution, weadapted the following approach. First we compute upperand lower covariance bounds for each pair of neurons, ac-cording to (Macke et al. 2009)

max {−pq,−(1− p)(1− q)} ≤ Cov(ri, rj)

≤ min {(1− q)p, (1− p)q} , (9)

where p and q are the means (mean spiking probabilities)of neuron i and j, respectively. We then choose a ran-dom symmetric matrix A that lies between these bounds.As in general this choice does not result in a permissiblecorrelation matrix for the underlying Gaussian, a Higham(2002) correction is applied to find the closest correlationmatrix possible for the latent Gaussian (cf. Macke et al.(2009)), to which we finally arithmetically add a randomcorrelation matrix with uniformly distributed eigenvaluesto adjust the mean correlation strength. Having estab-lished a dichotomized Gaussian model we can thus drawsamples with high efficiency.

Where not otherwise stated in the text, 10000 trials perstimulus were simulated, allowing 9000 training samplesand 1000 test samples with 10-fold crossvalidation. In theabsence of a detailed characterization of the correlationstructure of neural responses in the mouse visual cortex,

we assumed that the correlation in firing between eachpair of neurons was weak and positive. Our simulationresults in a mean correlation of 0.11 and a standard devi-ation of 0.040 (measured with 100000 samples for differ-ent ensemble sizes) for our basic model and similar levelsof correlation for nonzero γ, ξ. Due to limitations of thedichotomized Gaussian simulation, we were not able tospecify the correlations of the spike trains/between the in-dividual neurons exactly, thus all reported correlations aremeasured and may be prone to small variations. However,such limitations would be inherent to any simulation ap-proach, as i) the covariance structure of a multivariate bi-nary distribution is always constrained by the firing prob-abilities of the cells and can not be chosen independentlyof these firing rates (Macke et al. 2009), and ii) finite sam-pling effects will always affect the simulated data, result-ing in fluctuations in the correlation structure.

We were able to vary the (measured) mean absolutecorrelation level in some simulations, allowing an assess-ment of the relative effects of correlation strength for de-coding. To do this we proceeded as follows: having es-tablished the correlation matrix of the latent Gaussian asdescribed before allows us to sample in a regime with cor-relations around 0.1. Likewise a latent Gaussian with acorrelation matrix given by the identity matrix would cor-respond to the case where there are no correlations in thelatent Gaussian and thus in the simulated spike trains it-self, which means that the simulated spike patterns foreach neuron are independent (note that due to the modelused in Macke et al. (2009) the correlation matrix and the

5

covariance matrix of the latent Gaussian are the same). Fi-nite sampling effects might still introduce some nonzeromeasured correlations in the spike patterns, but the un-derlying distribution would still be with independent neu-rons. Thus, by interpolating between these two cases wecan reduce the effective correlations in the data in a con-trolled way, where at one end we yield our original modeland at the other end we yield a set of independent neurons.

2.5 Fitting the Ising Model ParametersFor fitting the model parameters in the Ising model casewe use two different strategies: mean field approxima-tions and minimum probability flow learning. In earlierwork we used Boltzmann learning (Seiler et al. 2009),however as this becomes rapidly computationally in-tractable with an increasing number of neurons, we havenot reported it here.

2.5.1 Mean Field Methods

The suitability of different Mean Field approaches for fit-ting the parameters of an Ising model have been recentlyassessed by Roudi, Tyrcha & Hertz (2009). In their work,Roudi et al. compared the learned model parameter asinferred by Boltzmann learning, with the parameters in-ferred by a number of different approximative algorithms.Here we examine the utility for decoding of using succes-sively higher order mean field approximations. Tanaka(1998) demonstrated how to systematically obtain meanfield approximations of increasing order based on a Ple-fka series expansion (Plefka 2006) of the Gibbs free en-ergy. By truncating these series to terms up to n-th or-der, and using the linear response correction, it is possibleto derive an n-th order mean field approximation, yield-ing C equations for the external fields, and C(C − 1)/2equations for the pairwise couplings (cf. Tanaka (1998)).These equations can then be solved with respect to Jijand hi. For higher order approximations, these equationscan have more than one solution. This problem can be re-solved by considering that the Plefka series expansion iseffectively a Taylor series expansion and continuity of thesolutions is expected when higher order terms are gradu-ally increased (Tanaka 1998).

Tanaka further provided an explanation for the “diago-nal weight trick” as used by Kappen & Rodrguez (1998).

With this trick one introduces C extra equations for thepairwise “self-couplings” Jii, which can be used to refinethe respective approximations. The success of this trickcan be explained (Tanaka 1998) by considering that us-ing this diagonal weight trick in an n-th order method, iseffectively incorporating the dominant terms of the nexthigher (n+ 1) order expansion.

For fitting the parameters, we have compared differentmean field approximations from zeroth to second orderwith and without the diagonal weight trick, thus incor-porating with our highest approximations all second or-der terms and the dominant third order terms of the freeenergy expansion. The zeroth order method is therebyequivalent to the independent model, thus providing an-other way of thinking about the independent decoder.

First order methods (naive mean field approximation).For the first order or naive mean field approximation, theequations for the external fields and pairwise couplingsbecome:

J = −C−1 (i 6= j), (10a)

hi = tanh−1mi −∑j 6=i

Jijmj , (10b)

where J is the estimated coupling matrix with elementsJij , the ‘magnetization’ mi = 〈σi〉, and the covariancematrix C is defined by Cij = 〈σiσj〉 − mimj . In thefollowing we will denote this naive mean field method bynMF.

By incorporating the diagonal weight trick the aboveequations change slightly into the following:

J = P−1 −C−1, (11a)

hi = tanh−1mi −∑j

Jijmj , (11b)

where in addition to the matrices defined above, we havedefined the diagonal matrix Pij = (1 −m2

i )δij , with δijbeing the Kronecker delta. Note that in the nomencla-ture of Roudi, Tyrcha & Hertz (2009), this has been called“naive mean field method”. However we will in the fol-lowing refer to it as naive mean field method with diago-nal weight trick (nMFwd).

6

Second order methods (TAP approximation). Theequations of the second order method are also known asthe TAP equations (Thouless et al. 1977) and can be seenas a correction to the naive mean field methods. Usingthe TAP approach the equations for the model parametersread:

2J 2ijmimj + Jij + (C−1)ij = 0 (i 6= j), (12a)

hi = tanh−1mi −∑j 6=i

Jijmj +mi

∑j 6=i

J 2ij(1−m2

j ),

(12b)

where the first equation can be solved for the pairwisecouplings Jij . As mentioned previously the correct so-lution has to be chosen according to continuity condi-tions outlined in Tanaka (1998), from which then theexternal fields hi can be computed. More precisely, ifmimj(C

−1)ij < 0 we choose the solution, which iscloser to the original first order mean field solution. Ifmimj(C

−1)ij > 0 we use the first order mean field solu-tion directly. We use this procedure as it avoids pairwisecouplings becoming complex, and respects the continu-ity of the inverse Ising problem for Jij as a function of(C−1)ij . In the following this method is denoted by TAP.

Incorporating third order terms. The second ordermethod can also be augmented by the diagonal weighttrick, hence incorporating the leading third order termsof the free energy expansion. The equations for the TAPmethod with diagonal weight trick (TAPwd) are given by:

2J 2ijmimj + Jij + (C−1)ij = 0 (i 6= j), (13a)

Jii =1

1−m2i

− (C−1)ii (i = j), (13b)

hi = tanh−1mi −∑j

Jijmj , (13c)

where we have solved the equations in an analogous fash-ion to that described for the normal TAP approach.

2.5.2 Minimum Probability Flow Learning

Sohl-Dickstein et al. (2009) recently proposed the Min-imum Probability Flow Learning (MPFL) technique,which provides a general framework for learning model

parameters. As this technique is also applicable to theIsing model, we have used it to learn external fields andpairwise couplings for our model. However, as the sam-pling regime usually feasible in neurophysiological exper-iments dictates a small number of samples compared tothe number of parameters in the model (which is O(C2)with C cells), the learning problem for the parameters be-comes under-constrained already at intermediate neuralensemble sizes, i.e. we are likely to have more param-eters to fit than there are samples.

We therefore introduced a regularization term to theiroriginal objective function to penalize model parametersgrowing to large numbers, i.e. to avoid overfitting. Giventhe original objective function K(θ) with θ being the pa-rameters of our model, our regularized objective functionreads:

Kreg(θ) = K(θ) +λ

2‖θ‖22, (14)

where ‖ · ‖2 is the L2 norm, which is a common choice ofregularization term (Bishop 2007). So for the Ising modelcase we have:

‖θ‖22 =∑i

h2i +∑i 6=j

J2ij .

For the present work, the regularization parameter wasset to λ = 0.0127, after systematically assessing differ-ent settings for an ensemble size of 100 cells via cross-validation. We refer to this learning algorithm as rMPFL(regularized MPFL) in the rest of the manuscript.

Other choices for the regularization term are possibleand might even result in better performance for decodingpurposes, e.g. two independent penalty terms for the ex-ternal fields and the pairwise couplings. However an ex-tensive assessment of different parameter settings wouldbe very time consuming due to the cost of calculatingthe invoked partition function. We therefore have notperformed an exhaustive analysis of regularization. In areal experiment the regularization term should however beadapted to yield the best possible performance for the spe-cific number of cells. We not that while it is not necessaryto compute the partition function for some applications ofMPFL (e.g. if learning the Jij parameters is the end initself), it is required for decoding.

7

2.6 Partition Function Estimation

Estimating the partition function is a computationally ex-pensive task, since the set of possible responses R growsexponentially with the number of cells C, rendering ananalytical computation (Equation 3) intractable for largeneural ensembles.

As MPFL learning does not provide an estimate of thepartition function, we used the Ogata-Tanemura partitionfunction estimator (Ogata & Tanemura 1984, Huang &Ogata 2001), which is based on Markov Chain MonteCarlo (MCMC) techniques. However MCMC is still avery time consuming technique. To speed up the modelfitting process, we estimated the partition function foreach stimulus only once in a 10 fold cross-validation runwhen using MPFL, as ideally all samples for a specificstimulus should come from the same distribution, thus ap-proximately sharing the same partition function. We haveexamined the effect of this approximation (see Results).

When fitting the model parameters with mean field the-oretic approaches, we computed the (true) partition func-tion Z(s) in the mean field approximation, as reported inKappen & Rodrguez (1998), Thouless et al. (1977), andTanaka (1998).

For the first order mean field approach this yields:

logZ =∑i

log(2 cosh

(hi +Wi

))−∑i

Wimi

+∑i<j

Jijmimj ,

(15)

withWi =

∑j 6=i

Jijmj .

Here each of the parameters is actually a function of thestimulus s, which we omit for clarity.

For the second order methods, the corresponding equa-tion becomes:

logZ =∑i

log(2 cosh

(hi + Li

))−∑i

Limi

+∑i<j

Jijmimj +1

2

∑i<j

J 2ij(1−m2

i )(1−m2j ),

(16)

with

Li =∑j 6=i

Jijmj −mi

∑j 6=i

J 2ij(1−m2

j ).

2.7 Performance evaluation

The fraction of correctly decoded trials was the principalmethod used to assess decoding performance. However,as the fraction correct or accuracy does not by itself pro-vide a complete description of decoder performance, wesought used additional performance measures. The per-formance of a decoder is fully described by its confusionmatrix, and we show how directly examining this matrixcan yield insight into its behaviour. However, it is advan-tageous to be able to reduce this to a single number inmany cirumstances. We therefore additionally computedthe mutual information between the encoded and decodedstimulus (Panzeri, Treves, Schultz & Rolls 1999) to char-acterise the performance further. This provides a compactsummary of the information content of the decoding con-fusion matrix.

We can write the mutual information (measured in bits)as:

I(s, s) = H(s)−H(s|s), (17)

where H(s) is the entropy of the encoded stimulus

H(s) = −∑s∈S

p(s) log2 p(s), (18)

and H(s|s) is the conditional entropy describing the dis-tribution of stimuli s that have been observed to elicit eachdecoded state s,

H(s|s) = −∑s,s∈S

p(s, s) log2 p(s|s). (19)

Since we have in the current study opted for a uniformstimulus distribution, the entropy H(s) is simply givenby

H(s) = log2 S. (20)

In general the conditional entropy H(s|s) has to be com-puted from the confusion matrix. We note that if we wereto assume that the correctly decoded stimuli and errors

8

are uniformly distributed for all stimuli, i.e. that the con-ditional distribution p(s|s) is of the form

p(s|s) =

fc for s = s1− fcS − 1

for s 6= s,

then the conditional entropy simplifies to

H(s|s) = −fc log2 fc − (1− fc) log2(1− fcS − 1

).

This simplified expression has been used to characterizedecoder performance in the Brain Computer Interface lit-erature (Wolpaw et al. 2002). Here, we present this equa-tion only to make apparent the scaling behaviour, andcompute the decoded information using the more generalexpression (Eqn. 17).

3 ResultsWe performed computer simulations as described inMethods, to generate datasets for training and testing de-coding algorithms. For all settings, 20 simulations wereperformed with different random number seeds, in orderto characterize decoding performance. A number of met-rics, including the fraction of correct decodings (accu-racy) and mutual information between decoded and pre-sented stimulus distributions, were used in order to char-acterize and compare decoding performance.

Basic Model

The Ising model based decoders show better performancethan the independent decoder in nearly all cases (Fig. 3a),in terms of the average fraction of correctly decoded stim-uli. The performance of the standard (non-regularized)MPFL technique, however, in our hands falls away rel-atively quickly as the number of cells increases, fail-ing to better the independent decoder after approximately100 cells. This behaviour can be explained by consid-ering that the problem of parameter estimation becomesmore and more underconstrained as the number of cellsincreases while holding the number of training samplesfixed. Falsely learned model parameters moreover affectthe decoding performance by influencing the estimated

partition function and thus worsen the decoder perfor-mance. As we have only estimated the partition functiononce per stimulus when using MPFL, large fluctuationsin the training dataset can potentially have a big effect.To compensate for this behaviour, a regularization termcan be included, which can stabilize the performance upto a significantly larger number of neurons (as describedin Methods). Our regularized version of MPFL still de-creases in performance after about 110 cells. However,we have used a fixed value for the regularization param-eter for all simulations here, whereas ideally the regular-ization term should be adapted to the number of cells andnumber of samples in the specific setting. Using such anapproach would most likely result in a better performancefor larger numbers of cells. As an example we have testedfor an optimized λ parameter for 150 cells and found thatwe could increase performance to 0.874 in this case (av-erage over 10 trials, not shown in figure).

One of the assumptions made for much of this pa-per is that the partition function can be estimated onlyonce in each 10-fold crossvalidation run, thus speedingup training. We studied the effect of this approximationon performance, finding, as shown in Figure 4, that theeffect is marginal, at least under our operating conditions.No statistically significant difference between the two ap-proaches was observed. In a decoding regime where therMPFL method starts to infer the wrong model parameters(e.g. due to overfitting) one would expect these effects tobecome more pronounced. However, in such a scenariowhere the rMPFL method becomes unreliable, a differentregularization term or a different inference method shouldbe considered.

The decoded information analysis reveals that the dif-ference in the decoding performance of the independentand Ising (as exemplified by TAP, TAPwd and rMPFL)models becomes more pronounced as the number of stim-uli is increased, as shown in Fig. 3b,c. As the numberof stimuli increase, the independent and Ising decodercurves separate, indicating not only a difference in the ac-curacy of both decoders, but also a difference in the con-fusion matrices, i.e. in the distribution of errors betweenthe two approaches. This is considered in more detail insection 3.2.

A further interesting behavior of the Ising decoder isapparent in Fig. 3c: as the number of stimuli increases,the relative decoding gain η, (which we define as the ra-

9

0.8

0.805

0.81

0.815

0.82

0.825

0.83

every round only once

Fra

ction C

orr

ect

Figure 4: Decoding performance effects of reduced par-tition function estimation for rMPFL method as Box-Whisker plot. Interquartile range is indicated by blue box,the median value by a red line. Whiskers are used to visu-alize the spread of the remaining data for values at mostwithin 1.5 times the interquartile range from the ends ofthe box. Left: results for computing the partitioning func-tion for each stimulus every round for 10-fold crossval-idation. Right: results for computing the partition func-tion only in the first round for each stimulus. Data from20 simulation runs, 70 cells, basic model, 10000 samplessimulated per stimulus

tio between the “actual” and “chance” fraction correct)keeps increasing with the number of stimuli for the Isingmodel case, whereas it saturates for the independent de-coder. This effect can be explained as follows: the inde-pendent decoder relies on the fact that two different stim-uli will result in different firing rates for each cell. With anincreasing number of stimuli however, the difference be-tween two adjacent stimuli becomes smaller and smaller,and thus the difference in the firing rates discriminatingtwo adjacent stimuli becomes smaller and smaller. There-fore the decoder performance of the independent decoderrapidly decreases as the number of stimuli increases. Astwo adjacent stimuli can result in neural responses thathave quite different correlation structure despite havingvery close firing rates, the Ising decoder can additionallymake use of this information in the data and thus yieldbetter performance. This suggests that the Ising decoder

may be particularly advantageous as the decoding prob-lem becomes more difficult and not easily discriminablein terms of the neural firing rates. Real world performancewill of course be dependent upon the level of systematicdifference in correlation structure induced by stimuli inany given dataset.

Performance of the Ising decoder is strongly dependenton the number of training trials available (Fig. 3d i).Here we found, for 70 neurons, that around 400 train-ing samples were required to allow the Ising decoder tooutperform independent decoding. The independent de-coder will necessarily have better sampling performance,as it relies only upon lower order response statistics. (It isworth recalling that a “full” decoder, which made use ofall aspects of spike pattern structure, would have far worsescaling behaviour than either). Another way of looking atthis is to examine the estimated covariance matrix fromthe simulated data (Fig. 3d ii). We define the mean rela-tive error of the covariance matrix as

E =1

S

∑s∈S

1

C2

∑i,j

∣∣∣∣∣Cdataij,s − Cas

ij,s

Casij,s

∣∣∣∣∣ , (21)

whereCdataij,s is the estimated covariance between unit i and

j under stimulus s from the (finite) simulated data (nor-mally 10000 samples), and Cas

ij,s is the asymptotic covari-ance, defined in the same way but computed with 100,000samples. This error provides a measure of the finite sam-pling bias we encounter for fitting the model.

Heterogeneous Models

The performance characteristics in a heterogeneous sce-nario, i.e. with nonzero ξ, γ, are for the most part broadlysimilar to the homogeneous case, so we report here onlyon the observed differences. The overall classificationperformance both in terms of fraction correct and mutualinformation, is slightly improved for both independentand Ising decoders with heterogeneous neural ensembles.As an example, the accuracy as a function of ensemblesize is compared for the basic model and the “fully hetero-geneous” (γ = ξ = 1) models in Fig.5, for both TAPwdand Independent decoders. The greater performance fornonzero γ, ξ can be explained by the greater variabilityof cell properties, allowing more specific response pat-terns than in the homogeneous scenario. Such a scenario

10

Figure 5: Decoder performance is enhanced in the hetero-geneous scenario (ξ = γ = 1). Comparison of Fractioncorrect vs. number of cells for basic model and fully het-erogeneous model. The TAPwd algorithm was used totrain the decoder.

is presumably relevant to many real-world decoding prob-lems, suggesting that decoder performance analysed withhomogeneous test data may slightly under-represent real-world performance.

We also assessed the relative influence of the individ-ual parameters γ, ξ on the decoding behavior as shown inFigure 6. Our analysis shows that, while the heterogene-ity in the firing rates as specified by γ has a positive effecton the decoding performance, an increased tuning widthheterogeneity slightly decreases the performance of thedecoder. However, the relative influence of the parameterξ is small.

3.1 Dependence on level of correlationAs the advantage of the Ising Decoder over the Inde-pendent Decoder stems from its ability to take advantageof information contained in pairwise correlations, we ex-amined the dependence of this advantage on the averagestrength of correlation. Although we have set the averagelevel of correlation to what has traditionally been thoughtto be a reasonable level for cells in the same neighbour-hood (Zohary & Shadlen 1994), there is an ongoing de-bate about the level and stimulus dependence of correla-

Figure 6: Decoder performance dependence on parame-ters ξ, γ. With increasing γ the decoding performance isenhanced. The tuning width heterogeneity as specified byξ reduces the performance of the decoder (values from up-per left to lower right curve ξ = {0, 0.2, 0.4, 0.6, 0.8, 1}).Data for 70 cells, TAPwd algorithm used to train the de-coder.

tion relevant for cortical function (Renart et al. 2010, Bairet al. 2001, Nase et al. 2003, Ecker et al. 2010). Thisis of course critical for the performance of the Ising de-coder. If there were no (noise) correlations present in thedata, i.e. an independent decoder were the correct modelfor the data, there would be no benefit to using any de-coder including correlation such as the Ising decoder. Infact, any decoder including correlations would most likelyperform worse in practice than an independent decoder,as due to finite sampling effects one would most likelyfalsely estimate some small correlation in the data. Over-fitting to these falsely learned correlations would then re-sult in a performance decrease compared to an indepen-dent decoder, which by construction does not include anycorrelations, i.e. would implement the correct model.

This effect be seen in Figure 7, where we have variedthe mean absolute correlation strength as outlined in themethods section. It can be seen that as the level of corre-lation increases for specified spike count, the independentdecoder loses some discriminative capacity. The Ising de-coders, however, take advantage of the higher level andspread of correlation values for discrimination between

11

rMPFL

Ave

rage S

tandard

Devi

atio

nP

ears

on C

oeffic

ient

Average Absolute Pearson Coefficient

Figure 7: Ising decoder benefits from a correlationregime. Upper panel: Fraction correct as a function ofthe measured average absolute Pearson correlation coeffi-cient for four decoding algorithms. Lower panel: AverageStandard Deviation of the Pearson Correlation coefficientvs. average Pearson correlation coefficient. All averagestaken over 20 simulations; 70 cells; basic model; correla-tions measured with 10000 samples.

stimuli. As described in Methods, at the lower end of thiscurve the underlying latent Gaussian has an identity corre-lation matrix, and thus the individual neurons are actuallyindependent, although the average measured absolute cor-relation does not completely decrease to zero. This pro-vides an explanation of why the Ising decoder fails to beatthe independent decoder: by assuming the measured cor-relations are due to the true distribution, it overfits to thesecorrelations, and thus performs worse compared to the(by construction correct) independent model. It should benoted, however, that the performance drop of the Ising de-coder compared to the independent decoder is small evenfor a regime where cells are effectively independent.

To mimic the richer correlation structures potentiallyfound in real neural recordings, we performed further test-ing. We simulated larger ensembles of neurons, whilekeeping the number of observed cells fixed, thus effec-tively creating ”hidden” neurons. The visible neuronswere simulated as described in methods, while for everyunobserved cell, the preferred tuning direction was cho-

10−2

10−1

100

101

0.6

0.7

0.8

0.9

1

Chidden

/Cvisible

fc / fc

0

Ind

TAP

TAPwd

rMPFL

Figure 8: Decoder robustness to correlations due to hid-den units. Fraction Correct fc for 100 cells, normalized tothe fraction correct fc0 where 100 out of 100 cells are vis-ible, is displayed as a function of the ratio between Chiddenhidden cells and Cvisible observed cells. For simulationsthe basic model was used.

sen randomly according to a uniform distribution. By us-ing this scheme we could assure that the observed Cvisiblecells always had the same characteristics except the cor-relation structure, which was effectively changed by theintroduction of Chidden unobserved neurons. It should benoted that introducing hidden cells alters not only the sec-ond order statistics but also higher order correlations inthe data, thus providing a much richer statistical structurein the data.

The different decoding (and training) methods vary intheir robustness towards the introduction of such higherorder correlation structure (Fig. 8). While the indepen-dent decoder is relatively unaffected by the introductionof additional unobserved cells, the Ising decoder modelis more sensitive to such changes. However, significantdifferences between the different training strategies canbe observed: in our case the rMPFL method showed theleast performance drop (about 10%) in a scenario whereapproximately 100 out of 500 cells were observed, whilethe fraction correct for the standard TAP approach dropsby more than 15% in this scenario. TAPwd is less af-fected than TAP, consistent with its inclusion of the lead-ing terms for the next order expansion.

12

3.2 Confusion Matrix AnalysisThe confusion matrix provides complete informationabout the decoding error distribution. Confusion matri-ces for the Ising decoder and the Independent decoderrespectively are shown in Fig. 9. The Ising decoder hashigher diagonal terms, corresponding to better decodingaccuracy. The overall appearance of the Ising decoderconfusion matrix is fairly similar to the independent de-coder. However, by comparing the two confusion ma-trices it can be seen that the Ising decoder mainly gainsits performance benefit over the independent decoder byavoiding confusion of adjacent stimuli. This shows thatas the difference in adjacent stimulus directions becomesless with an increasing number of stimuli, the Ising modeldecoder can utilize the correlation patterns to enhance thedecoding accuracy, i.e. to distinguish between adjacentstimulus directions more precisely. This effect of coursedepends on the correlation model used - here the correla-tion between each pair of neurons was resampled for eachstimulus in the simulation. If (noise) correlations werenot at all stimulus-dependent, or if the model was quitedifferent, then the Ising decoder may not be able to takeadvantage of this potential performance advantage.

3.3 Comparison with linear decodingWhile most work in the Brain-Machine Interface litera-ture has focused on continuous decoders, there has beensome work on the discrete decoding problem, althoughto date with a focus on the analysis of either contin-uous data, such as electroencephalographic (EEG) data(Wolpaw et al. 2002), or on longer time windows of multi-electrode array data, in which spike counts are far frombinary (Santhanam et al. 2009). As the discrete decodingproblem can be viewed as a classification problem (in thesame sense as the continuous decoding problem can beseen as a regression problem), it is of interest to comparethe performance of our approach with traditional classi-fication approaches such as the Optimal Linear Classifer(OLC).

Following Bishop (2007), each stimulus class can bedescribed by its own linear model, so that

yk = wTs ~σ + wk0 , (22)

where s = 1, . . . , S. Using a 1-of-S binary coding

scheme (i.e. we denote stimulus class si by a “target” col-umn vector t with all zeros except the i-th entry, which isone) the weights ws can be trained such as to minimize asum of squares error function for the target stimulus vec-tor.

This is done in Fig. 10. It can be seen that, under theconditions we test here (10000 trials, simulated data asdescribed previously for the basic model), the OLC un-derperforms both the Ising (as exemplified by the TAPwdapproach) and Independent classifiers. The former is notunexpected, but it may seem initially counter-intuitive thatthe OLC does not yield identical performance to the In-dependent decoder, as the latter is in effect performing alinear classification.

However, the following differences must be noted forthese two algorithms. As their implementation detailsdiffer, they may have markedly different sensitivity tolimited sampling - the independent decoder, as we haveconstructed it here (product of marginals) has remark-able sampling efficiency. Most importantly however, theOLC decoder assumes by construction a Gaussian errorin the stimulus class target vectors, i.e. it corresponds toa maximum likelihood decoding when assuming that thetarget vectors follow a Gaussian conditional distribution(Bishop 2007). This assumption is clearly not valid in thecase of binary target vectors. Therefore the failure of theOLC decoder should not be surprising.

4 DiscussionWe have demonstrated, for the first time, the use of theIsing model to decode discrete stimulus states from simu-lated large-scale neural population activity. To do this, wehave had to overcome several technical obstacles, namelythe poor scaling properties of previously used algorithmsfor learning model parameters, and similarly the poorscaling behavior of methods for estimating the partitionfunction, which although not necessary for some appli-cations of the Ising model in neuroscience, is requiredfor decoding. The Ising model has one particular advan-tage over a simpler independent decoding algorithm: itcan take advantage of stimulus dependence in the cor-relation structure of neuronal responses, where it exists.With the aid of a statistical simulation of neuronal ensem-ble spiking responses in the mouse visual cortex, we have

13

101

102

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Number of Cells

Fra

ction C

orr

ect

TAPwd

Independent

OLC

Figure 10: Comparison of the TAPwd Ising Decoder andIndependent Decoder with the Optimal Linear Classifier(OLC).

demonstrated that correlational information can be takenadvantage of for decoding the activity of neuronal ensem-bles of size in the hundreds by several algorithms, includ-ing mean field methods from statistical physics and therMPFL algorithm.

Ising models have gained much attraction recently inneuroscience to describe the spike train statistics of neu-ral ensembles (Schneidman et al. 2006, Shlens et al. 2006,Shlens et al. 2009). However, these findings have largelybeen made only in relatively small neural ensembles (typ-ically a few tens of cells), from which an extrapolation tolarger ensemble sizes might not be wise (Roudi, Niren-berg & Latham 2009, Roudi, Aurell & Hertz 2009). Theprincipal reason for this limit has been the poor scaling ofthe computational load of fitting the Ising model param-eters, when algorithms such as Boltzmann learning areused. Moreover, new findings suggests that pairwise cor-relations (and thus Ising models) might not be sufficient topredict spike patterns of small scale local clusters of neu-rons (< 300µm apart), which have been observed to pro-vide evidence of higher order interactions (Ohiorhenuanet al. 2010). While the formalism for higher order mod-els may be similar, scaling properties are guaranteed to beeven worse. There is thus the pressing need to developbetter algorithms for learning the parameters of Ising andIsing-like models. It should be noted that while Ising

models might not necessarily provide an exact descrip-tion of neural spike train statistics, for decoding purposeswe only require that we can approximate their distributionwell enough to achieve good decoding performance.

The development of discrete neural population decod-ing algorithms has two motivations. The first motivationis the desire to develop brain-computer communicationdevices for cognitively intact patients with severe motordisabilities (Mak & Wolpaw 2009). In this type of ap-plication, an algorithm such as those we describe couldbe used together with multi-electrode brain recordings toallow the user to select one of a number of options (forinstance a letter from a virtual keyboard), or even in thelonger term to communicate sequences of symbols froman optimized code directly into a computer system orcommunications protocol. Given the short timescale towhich the Ising decoder can be applied (we have fixedthis at 20 ms here), and sufficiently large recorded ensem-bles to saturate decoder performance, very high bit ratescould potentially be achieved.

The second motivation is more scientific: to use suchdecoding algorithms to probe the organization and mech-anisms of information processing in neural systems. Itshould be immediately be apparent that the Ising decoderand related models can be used to ask questions aboutthe neural representation of sensory stimuli, motor statesor other behavioral correlates, by comparing decodingperformance under different sets of assumptions (for in-stance, by changing the constraints in an Ising model toexclude correlations, include correlations within 50 µm,etc). This (commonly referred to as the “encoding prob-lem”) is essentially the same use to which Shannon in-formation theory has been applied in neuroscience (seee.g. Schultz et al. (2009) for a recent example), withsimply a different summary measure. Use of decodingperformance may be an intuitively convenient way to asksuch questions, but it is still asking exactly the same ques-tion. However, there are other uses to which such algo-rithms can be applied. For instance, combining sensoryand learning/memory experimental paradigms, once a de-coder has been trained, it could be used subsequently toread out activity patterns in different brain states such assleep, or following some period of time - for instance, to”read out memories” by decoding the patterns of activ-ity that represent them. The decoding approach may thushave much to offer the study of information processing in

14

neural circuits.Our results show that decoding performance is criti-

cally dependent on the sample size used for training thedecoder, as relatively precise characterization of pairwisecorrelations is needed to fit a model that matches the sta-tistical structure of as-of-yet unobserved data well. In the“encoding problem”, such finite sampling constraints re-sult in a biased estimate of the entropy of the system. Forthe decoding problem considered here, finite samplingleads to overfitting of the model to the observed trainingdata, so that it does not generalize well to the unobserveddata, and accordingly fails to predict stimulus classes cor-rectly during test trials. Such finite sampling constraintsmean that below a particular sampling size - which wefound to be 400 trials for 70 neurons in one particularexample we studied - there is no point in using a modelwhich attempts to fit pairwise (or above) correlations, onemay as well just use an independent model. This has im-plications for experimental design. However, it should benoted that the real brain has no such limitation - in effect,many thousands or millions of trials are available over de-velopment, and so a biological system should certainly becapable of learning the correct correlations from the data(Bi & Poo 2001) and thus may well be able to operate in aregime where decoding benefits substantially from knowncorrelation structure.

We have shown that incorporating correlations in thedecoding process might be especially relevant for ‘hard’decoding problems, i.e. multi-class discrimination prob-lems in which stimuli are not easily distinguishable byjust observing individual neuronal firing rates. In this sce-nario including correlations could be a means to enhancethe precision of the decoding process by increasing thediscriminability between adjacent or similar stimuli. In-cluding correlations could make the pattern distributionmore flat, or uniform, with low firing rates, leading togreater energy efficiency of information coding (Baddeleyet al. 1997).

We must insert a note of caution concerning the pres-ence of higher order correlations in data to be fitted. Suchhigher order correlations might arise simply from thepresence of a large number of ”hidden” neurons whoseactivity has not been recorded, but which have a substan-tial influence on the activity of those neurons recorded.This is likely to occur frequently in real neurophysiolog-ical situations. The performance of pairwise correlation

decoder models, such as the Ising model, is necessarilyaffected detrimentally by such effects, as shown in Fig. 8.Interestingly, an independent decoder is less affected (al-though it may also be capturing less information anyway).Obviously, it is possible to make use of higher order mod-els to alleviate this problem, but a penalty then has to bepaid in sampling terms.

We note that the primary use of the Ising model inneuroscience so far has been to model the empiricalstatistics of neural spike train ensembles (Schneidmanet al. 2006, Shlens et al. 2006). There is of course norequirement that a good decoder is also a good model ofneural spike train statistics - what matters is its perfor-mance on the test dataset. Nevertheless, knowing howwell the model captures spike train pattern structure mayhelp to build better decoders, and of course may be of par-ticular value when those decoders are used to study neu-ral information processing (as opposed to being used fora practical purpose such as brain-machine interface devel-opment). Unfortunately, a direct comparison of empiricalto model spike pattern probabilities is not experimentallyfeasible for large ensemble sizes - while this is relativelysimple for 15 cells, it is far from viable for 500 cells.Further work is needed to determine how best to evaluatethe performance of decoders at predicting empirical spikepatterns, when decoding very large neural ensembles.

One caveat to the advantage provided by correlationsof using Ising over Independent decoders is that it de-pends entirely upon the extent to which correlations arefound to depend upon the stimulus variables of interest.While a previous study at longer timescales has found cor-relations to improve neural decoding (Chen et al. 2006),the jury is still out on the prevalence of stimulus de-pendence of pairwise and higher order correlations inthe mammalian cortex. Stimulus-dependent correlationshave been found in mouse (Nase et al. 2003), cat (Das& Gilbert 1999, Samonds & Bonds 2005) and monkey(Kohn & Smith 2005, Kohn et al. 2009), where theyhave been shown to contribute to information encoding(Montani et al. 2007), but most recordings to date havesampled relatively sparsely from the local cortical circuit,due to limitations in multi-electrode array hardware. It ispossible that if one were to be able to record from a greaterproportion of neurons in the local circuit, then strongerstimulus-dependent correlations might be observable.

In an experimental setting the performance could also

15

be enhanced by choosing an optimal timebin-width, aquestion that we have not addressed in this paper. How-ever care is needed for choosing the right bin-width. Asshown by Roudi et al. (Roudi, Aurell & Hertz 2009,Roudi, Nirenberg & Latham 2009), a small bin-width islikely to yield a good fit, however choosing a too smallbin-width invalidates the underlying assumption of uncor-related time-bins. Choosing a too large time-bin makes ithowever harder to find a good fit for the model and addi-tionally may violate the assumptions of binary responses.

A number of avenues present themselves for future de-velopment of decoding algorithms. Firstly, algorithmsfor reducing model dimensionality without losing dis-criminatory power, may prove advantageous. Thesemay include graph and hypergraph theoretic techniques(Aghagolzadeh et al. 2010) for pruning out uninformativedimensions (edges and nodes), and factor analysis meth-ods for modeling conditional dependencies (Santhanamet al. 2009). Such an approach may be particularly ad-vantageous when experimental trials are limited, as thedimensionality of the parameter set is the main reason forIsing model performance not exceeding the Independentmodel for limited trials. One difficulty with the use ofgraph pruning approaches is that the usual pairwise cor-relation matrix of neural recordings, unlike the graph inmany network analysis problems, tends not to be sparse.It is of course a functional, as opposed to synaptic, con-nectivity matrix, and one reason for this lack of sparsenessis its symmetric nature. It has recently been proposed thatthe symmetry property of the Jij matrix can be relaxed inthe context of the (non-equilibrium) Kinetic Ising model,which also provides a convenient way to take into accountspace-time dependencies, or causal relationships (Hertzet al. 2010, Roudi & Hertz 2011). Use of the Kinetic Isingmodel framework for decoding would appear to be an in-teresting future direction to pursue.

New experimental technologies are yielding increas-ingly high dimensional multivariate neurophysiologicaldatasets, usually without concomitant increases in the du-ration of data that can be collected. However, there issome reason for optimism that we will be able to developnew data analysis methods capable of taking advantage ofthis data. Maximum entropy approaches to the fitting ofstructured parametric models such as the Ising model andits extensions would appear to be one approach likely toyield progress.

Acknowledgements

We thank Phil Bream, Helene Seiler and Yang Zhangfor their contributions to earlier work leading up to thatreported here, and Aman Saleem for useful discussionsand comments on this manuscript. We also thank YasserRoudi for useful comments on the TAP equations, andJascha Sohl-Dickstein, Peter Battaglino, and Michael RDeWeese for useful MATLAB code and discussion of theMPFL technique. This work was supported by EPSRCgrant EP/E002331/1 to SRS.

ReferencesAghagolzadeh, M., Eldawlatly, S. & Oweiss, K. (2010),

‘Synergistic Coding by Cortical Neural Ensembles.’,IEEE Trans Inf Theory 56(2), 875–899.

Baddeley, R., Abbott, L. F., Booth, M. C., Sengpiel, F.,Freeman, T., Wakeman, E. A. & Rolls, E. T. (1997),‘Responses of neurons in primary and inferior tem-poral visual cortices to natural scenes.’, Proc BiolSci 264(1389), 1775–83.

Bair, W. & Koch, C. (1996), ‘Temporal precision of spiketrains in extrastriate cortex of the behaving macaquemonkey.’, Neural Comput 8(6), 1185–202.

Bair, W., Zohary, E. & Newsome, W. T. (2001), ‘Cor-related firing in macaque visual area MT: timescales and relationship to behavior.’, J Neurosci21(5), 1676–97.

Berger, A. L., Pietra, V. J. D. & Pietra, S. A. D. (1996),‘A maximum entropy approach to natural languageprocessing’, Comput. Linguist. 22(1), 39–71.

Bi, G.-q. & Poo, M.-m. (2001), ‘Synaptic modification bycorrelated activity: hebb’s postulate revisited’, An-nual Review of Neuroscience 24(1), 139–166.

Bishop, C. M. (2007), Pattern Recognition and MachineLearning (Information Science and Statistics), 1sted. 2006. corr. 2nd printing edn, Springer.

Broderick, T., Dudık, M., Tkacik, G., Schapire, R. E. &Bialek, W. (2007), ‘Faster solutions of the inversepairwise Ising problem’, http://arxiv.org, Identifier:0712.2437v2.

16

Butts, D. A., Weng, C., Jin, J., Yeh, C.-I. I., Lesica, N. A.,Alonso, J.-M. M. & Stanley, G. B. (2007), ‘Tempo-ral precision in the neural code and the timescales ofnatural vision.’, Nature 449(7158), 92–5.

Carr, C. E. (1993), ‘Processing of temporal information inthe brain’, Annu Rev Neurosci 16, 223–43.

Chen, Y., Geisler, W. S. & Seidemann, E. (2006), ‘Op-timal decoding of correlated neural population re-sponses in the primate visual cortex.’, Nat Neurosci9(11), 1412–20.

Das, A. & Gilbert, C. D. (1999), ‘Topography of con-textual modulations mediated by short-range inter-actions in primary visual cortex’, Nature 399, 643–4.

Ecker, A. S., Berens, P., Keliris, G. A., Bethge, M., Lo-gothetis, N. K. & Tolias, A. S. (2010), ‘DecorrelatedNeuronal Firing in Cortical Microcircuits’, Science327(5965), 584–587.

Foldiak, P. (1993), The ‘ideal homunculus’: statistical in-ference from neural population responses, in ‘Com-putation and Neural Systems’, Kluwer AcademicPublishers, Norwell, MA, pp. 55–60.

Gawne, T. J., Kjaer, T. W., Hertz, J. A. & Richmond,B. J. (1996), ‘Adjacent visual cortical complex cellsshare about 20% of their stimulus-related informa-tion.’, Cereb Cortex 6(3), 482–9.

Hertz, J., Roudi, Y., Thorning, A., Tyrcha, J., Aurell, E.& Zeng, H. L. (2010), ‘Inferring network connectiv-ity using kinetic Ising models’, BMC Neuroscience11(Suppl 1), P51.

Higham, N. J. (2002), ‘Computing the nearest correla-tion matrix–a problem from finance’, IMA Journalof Numerical Analysis 22(3), 329.

Huang, F. & Ogata, Y. (2001), ‘Comparison of two meth-ods for calculating the partition functions of var-ious spatial statistical models’, Australian & NewZealand Journal of Statistics 43(1), 47–65.

Jaynes, E. T. (1957), ‘Information Theory and StatisticalMechanics’, Phys. Rev. 106(4), 620–630.

Kappen, H. J. & Rodrguez, F. B. (1998), ‘Efficient Learn-ing in Boltzmann Machines Using Linear ResponseTheory’, Neural Computation 10(5), 1137–1156.

Kohn, A. & Smith, M. A. (2005), ‘Stimulus dependenceof neuronal correlation in primary visual cortex ofthe macaque.’, J Neurosci 25(14), 3661–73.

Kohn, A., Zandvakili, A. & Smith, M. A. (2009), ‘Cor-relations and brain states: from electrophysiol-ogy to functional imaging.’, Curr Opin Neurobiol19(4), 434–8.

Macke, J. H., Berens, P., Ecker, A. S., Tolias, A. S. &Bethge, M. (2009), ‘Generating Spike Trains withSpecified Correlation Coefficients’, Neural Compu-tation 21(2), 397–423.

Mak, J. N. & Wolpaw, J. R. (2009), ‘Clinical Applicationsof Brain-Computer Interfaces: Current State and Fu-ture Prospects.’, IEEE Rev Biomed Eng 2, 187–199.

Montani, F., Kohn, A., Smith, M. A. & Schultz, S. R.(2007), ‘The Role of Correlations in Direction andContrast Coding in the Primary Visual Cortex’, J.Neurosci. 27(9), 2338–2348.

Nase, G., Singer, W., Monyer, H. & Engel, A. K. (2003),‘Features of Neuronal Synchrony in Mouse VisualCortex’, J Neurophysiol 90(2), 1115–1123.

Niell, C. M. & Stryker, M. P. (2008), ‘Highly SelectiveReceptive Fields in Mouse Visual Cortex’, J. Neu-rosci. 28(30), 7520–7536.

Ogata, Y. & Tanemura, M. (1984), ‘Likelihood analysisof spatial point patterns’, Journal of the royal statis-tical Society. Series B. 46(3), 496–518.

Ohiorhenuan, I. E., Mechler, F., Purpura, K. P., Schmid,A. M., Hu, Q. & Victor, J. D. (2010), ‘Sparse cod-ing and high-order correlations in fine-scale corticalnetworks’, Nature 466(7306), 617–621.

Oram, M. W., Fldik, P., Perrett, D. I., Oram, M. W. & Sen-gpiel, F. (1998), ‘The ’Ideal Homunculus’: decodingneural population signals’, Trends in Neurosciences21(6), 259 – 265.

17

Panzeri, S. & Schultz, S. R. (2001), ‘A unified approachto the study of temporal, correlational, and rate cod-ing.’, Neural Comput 13(6), 1311–49.

Panzeri, S., Schultz, S. R., Treves, A. & Rolls, E. T.(1999), ‘Correlations and the encoding of infor-mation in the nervous system’, Proceedings of theRoyal Society of London. Series B: Biological Sci-ences 266(1423), 1001–1012.

Panzeri, S., Treves, A., Schultz, S. & Rolls, E. T. (1999),‘On decoding the responses of a population of neu-rons from short time windows.’, Neural Comput11(7), 1553–77.

Petersen, R. S., Panzeri, S. & Diamond, M. E. (2001),‘Population Coding of Stimulus Location in Rat So-matosensory Cortex’, Neuron 32(3), 503 – 514.

Plefka, T. (2006), ‘Expansion of the Gibbs potential forquantum many-body systems: General formalismwith applications to the spin glass and the weaklynonideal Bose gas’, Phys. Rev. E 73(1), 016129.

Pola, G., Thiele, A., Hoffmann, K. & Panzeri, S. (2003),‘An exact method to quantify the information trans-mitted by different mechanisms of correlationalcoding’, Network-Computation in Neural Systems14(1), 35–60.

Reich, D. S., Mechler, F. & Victor, J. D. (2001), ‘Indepen-dent and Redundant Information in Nearby CorticalNeurons’, Science 294(5551), 2566–2568.

Renart, A., de la Rocha, J., Bartho, P., Hollender, L.,Parga, N., Reyes, A. & Harris, K. D. (2010), ‘Theasynchronous state in cortical circuits.’, Science327(5965), 587–90.

Roudi, Y., Aurell, E. & Hertz, J. A. (2009), ‘Statisticalphysics of pairwise probability models’, Frontiers inComputational Neuroscience 3:22, 1–15.

Roudi, Y. & Hertz, J. (2011), ‘Mean field theory for non-equilibrium network reconstruction’, Physical Re-view Letters 106, 048702.

Roudi, Y., Nirenberg, S. & Latham, P. E. (2009),‘Pairwise Maximum Entropy Models for Study-ing Large Biological Systems: When They Can

Work and When They Can’t’, PLoS Comput Biol5(5), e1000380.

Roudi, Y., Tyrcha, J. & Hertz, J. (2009), ‘Ising model forneural data: Model quality and approximate meth-ods for extracting functional connectivity’, Phys.Rev. E 79(5), 051915 (12 pages).

Samonds, J. M. & Bonds, A. B. (2005), ‘Gamma oscil-lation maintains stimulus structure-dependent syn-chronization in cat visual cortex.’, J Neurophysiol93(1), 223–36.

Santhanam, G., Yu, B. M., Gilja, V., Ryu, S. I., Afshar, A.,Sahani, M. & Shenoy, K. V. (2009), ‘Factor-analysismethods for higher-performance neural prostheses.’,J Neurophysiol 102(2), 1315–30.

Schneidman, E., Berry II, M. J., Segev, R. & Bialek, W.(2006), ‘Weak pairwise correlations imply stronglycorrelated network states in a neural population’,Nature 440(7087), 1007–1012.

Schultz, S. R., Kitamura, K., Post-Uiterweer, A., Krupic,J. & Hausser, M. (2009), ‘Spatial Pattern Codingof Sensory Information by Climbing Fiber-EvokedCalcium Signals in Networks of Neighboring Cere-bellar Purkinje Cells’, J. Neurosci. 29(25), 8005–8015.

Schultz, S. R. & Panzeri, S. (2001), ‘Temporal correla-tions and neural spike train entropy.’, Phys Rev Lett86(25), 5823–6.

Seiler, H., Zhang, Y., Saleem, A., Bream, P., Apergis-Schoute, J. & Schultz, S. R. (2009), ‘Maximum en-tropy decoding of multivariate neural spike trains’,BMC Neuroscience 10(Suppl 1), P107.

Shlens, J., Field, G. D., Gauthier, J. L., Greschner, M.,Sher, A., Litke, A. M. & Chichilnisky, E. J. (2009),‘The Structure of Large-Scale Synchronized Firingin Primate Retina’, J. Neurosci. 29(15), 5022–5031.

Shlens, J., Field, G. D., Gauthier, J. L., Grivich,M. I., Petrusca, D., Sher, A., Litke, A. M. &Chichilnisky, E. J. (2006), ‘The Structure of Multi-Neuron Firing Patterns in Primate Retina’, J. Neu-rosci. 26(32), 8254–8266.

18

Sohl-Dickstein, J., Battaglino, P. & DeWeese, M. R.(2009), ‘Minimum Probability Flow Learning’,http://arxiv.org/abs/0906.4779 arXiv:0906.4779v2.

Tanaka, T. (1998), ‘Mean-field theory of Boltzmann ma-chine learning’, Phys. Rev. E 58(2), 2302–2310.

Thouless, D. J., Anderson, P. W. & Palmer, R. G. (1977),‘Solution of solvable model of a spin glass’, PhilosMag 35(3), 593–601.

Wolpaw, J. R., Birbaumer, N., McFarland, D. J.,Pfurtscheller, G. & Vaughan, T. M. (2002), ‘Brain-computer interfaces for communication and con-trol.’, Clin Neurophysiol 113(6), 767–91.

Zohary, E. & Shadlen, M. (1994), ‘Correlated neuronaldischarge rate and its implications for psychophysi-cal performance’, Nature 370(6485), 140–143.

19

Figure 1: Tuning curves for a neural ensemble of 50 cells for different heterogeneity parameters γ, ξ. The shownspiking rates correspond to the transient rates. a Tuning curves for basic model, γ = ξ = 0. b Tuning curves withheterogeneous firing rates (γ = 1), but fixed tuning widths (ξ = 0) (NB the tuning width is defined relative to thespontaneous activity). c Tuning curves with heterogeneous tuning widths (ξ = 1), but fixed firing rates (γ = 0). dTuning curves for fully heterogeneous scenario (γ = ξ = 1)

20

Cell

Cell

Pearson Coefficient

Stimulus 1 Stimulus 2 Stimulus 3 Stimulus 4

Cell

Trial

Cell

Trial

a

b

Figure 2: Neural ensemble responses simulated for basic model (γ = ξ = 0) for 50 cells. a Correlation matrices for4 stimuli (computed with 100000 samples). Diagonal terms set to zero for visualization purposes only. b Simulatedpopulation neural responses over 1000 trials for two different stimuli, with black indicated a spiking response from aneuron on a given trial.

21

a

ci

ii

i

ii

d

Ind

Chance

b

Mean R

ela

tive

Cova

riance

Err

or

Figure 3: Performance of decoding algorithms for the basic model. a Fraction of correct decodings versus neuralensemble size, for a training dataset size of 9000 samples per stimulus. b The relationship between fraction correctand mutual information I(s, s) for varying stimulus set and ensemble size (C = {50, 80, 110, 140, 170, 200} varyingfrom bottom left to top right for each symbol type). Triangles denote the performance of the Independent decoder,squares the TAP Ising Decoder with diagonal weight trick. c The dependence of decoding performance on stimulusset size for 70 cells. i TAP, TAPwd, Independent and rMPFL decoders compared to random selection of stimuli. Thisis replotted in ii as the gain in fraction correct over chance performance, η = pdec/pguess, making the performancesaturation for the independent decoder as problem difficulty increases more apparent. d Dependence of decoderperformance on training set sample size, for 4 stimuli and a neural ensemble size of 70. i Fraction correct as a functionof number of training samples. Below 450 training samples the Ising decoder fails to better the independent decoder.ii Relation between mean relative covariance error E as a measure of finite sampling effects and fraction correct.

22

relative frequency

Figure 9: Decoder confusion matrices. The case illustrated is for 16 Stimuli, a 200-neuron ensemble, under the basicmodel (γ = ξ = 0). Relative frequency as an approximation to the conditional probability with which each stimulusis decoded, for 10000 stimulus presentations. a Ising model decoder (TAPwd). b Independent decoder. c A differenceplot between a and b.

23