Noise and randomlike behavior in perceptrons: theory and application to protein structure prediction

PHYSICAL REVIEW E JUNE 1997VOLUME 55, NUMBER 6

Noise and randomlike behavior of perceptrons: Theory and applicationto protein structure prediction

M. CompianiDipartimento di Scienze Chimiche, Universita` di Camerino, 62032 Camerino MC, Italy

P. Fariselli and R. CasadioLaboratorio di Biofisica, Dipartimento di Biologia, Universita` di Bologna, 40126 Bologna, Italy

~Received 30 August 1995; revised manuscript received 27 September 1996!

In the first part of this paper we study the performance of a single-layer perceptron that is expected toclassify patterns into classes in the case where the mapping to be learned is corrupted by noise. Extendingprevious results concerning the statistical behavior of perceptrons, we distinguish two mutually exclusive kindsof noise (I noise andR noise! and study their effect on the statistical information that can be drawn from theoutput. In the presence ofI noise, the learning stage results in the convergence of the output to the probabilitiesthat the input occurs in each class.R noise, on the contrary, perturbs the learning of probabilities to the extentthat the performance of the perceptron deteriorates and the network becomes equivalent to a random predictor.We derive an analytical expression for the efficiency of classification of inputs affected by strongR noise. Weargue that, from the standpoint of the efficiency score, the network is equivalent to a device performing biasedrandom flights in the space of the weights, which are ruled by the statistical information stored by the networkduring the learning stage. The second part of the paper is devoted to the application of our model to theprediction of protein secondary structures where one has to deal with the effects ofR noise. Our results areshown to be consistent with data drawn from experiments and simulations of the folding process. In particular,the existence of coding and noncoding traits of the protein is properly rationalized in terms ofR-noiseintensity. In addition, our model provides a justification of the seeming existence of a relationship between theprediction efficiency and the amount ofR noise in the sequence-to-structure mapping. Finally, we define anentropylike parameter that is useful as a measure ofR noise.@S1063-651X~97!02004-7#

PACS number~s!: 87.10.1e

arur

nconeduicivp

istrossmofuithetorengnfuimla

theTheicaltosbe

iallymed.pu-acepshtson-de.ter-is

ap-in

t, atheatc-ofonedic-, the

I. INTRODUCTION

The present paper belongs to the mainstream of resethat focuses on the statistical aspects of learning in nenetworks~for a review see@1,2#!; our main scope is to clarifytheir capability to detect statistical information in the learing set. Emphasis on these aspects is motivated by thesideration that in many cases of practical interest neuralworks detect statistical features of the problem under stuWe limit our investigation to the simplest feed-forward neral networks, viz., the single-layer perceptrons, on whanalytical considerations can be carried out with relatease. From now on, for simplicity, the single-layer percetron will be referred to as the perceptron.

Our starting point is the notion that pattern overlapsimultaneously the strength and the weakness of percepused as classifiers. In point of fact overlap promotes clafication of never-seen-before inputs but, at the same tigenerates noise@3# that poses limitations to the accuracyclassification. Generally, neural networks are used to ban artificial mapping linking the end states of processesare too complex for being extensively simulated or theorcally investigated. In this context, beside using the netwas a black box, it might be desirable to fully exploit thinformation captured by the network during the trainistage. In view of the application of the network to an uknown mapping, these pieces of information may be helpto get better insights into the problem at hand. To this awe devote the first part of the paper to inquiry on the re

551063-651X/97/55~6!/7334~10!/$10.00

chal

-n-t-y.-he-

nsi-e,

ldati-k

-l

-

tionships between the statistical information extracted bynetwork and the characteristics of some test mappings.very notion of a perceptron as a device sensitive to statistfeatures alludes to the capability of feed-forward netsrecord information in the form of Bayesian probabilitie@3–5#. However, storage of probabilistic information canperturbed by a kind of noise (R noise! that was not taken intoaccount so far. We suggest that for patterns substantaffected byR noise, the network can be likened to a randoclassifier as far as the efficiency of prediction is concernThe performance can be evaluated by modeling the comtation of each output neuron as a random walk in the spof the weights, where the probability of the individual steis dictated by the statistical information stored in the weigconverging onto the output neuron at hand. Under these cditions the perceptron is said to operate in randomlike moThe main part of the paper is devoted to the full characization of the randomlike behavior of a perceptron as itfaced with a noisy mapping.

A case in point is the primary-to-secondary structure mping of proteins that is studied by means of perceptronsthe second part of the present work. As a matter of facmore specific motivation for the present investigation isurgent need for a clarification of the limiting factors thaffect the efficiency of prediction of protein secondary strutures@3,6–8#. Some puzzling problems arise in the contextthis application and demand proper explanation: on thehand, the finding that perceptrons are as effective as pretors based on statistical methods and, on the other hand

7334 © 1997 The American Physical Society

cin

e.coftrewa-tireanethivldnseanawre

cornters

i-

e

eothttew

rr

evire

proi

hea-lgo-o. IIIhi-ofAp-

to

ic

igu-arethe

ow

alrlapre itd inns.

byegerbi-heitetheenedehe

aysthetly,

ionl,f

theso--nly

55 7335NOISE AND RANDOMLIKE BEHAVIOR OF . . .

sensitivity of the network to the number of examples in eaclass~i.e., approximately the frequency of occurrence ofputs in each class! rather than to specific patterns.

The plan of the paper is the following. In Sec. II wdescribe the mapping the network is expected to learnSec. III we classify the sorts of noise that potentially affethe unknown mapping. We then explain the appearancerandomlike component that in Sec. IV is modeled as a seconcurrent random walks and arrive at an analytical expsion for the network to predict each class. In Sec. Vspecialize our considerations to the prediction of secondstructures of proteins@7,9–11#. Data from our previous experiments in this area are then used to make a semiquantive check of the theoretical predictions of Sec. IV. Thesults of our simulations suggest the introduction ofentropylike measure of the single-pattern ambiguity, as was a global measure of intensity of the noise affectingsequence-to-structure mapping. Finally, in the conclussection, we point out the bearings of our model on the foing code and the folding mechanism of globular proteiMoreover, the limited performance of the network is tracback to the intensity of noise and to the noise-induced rdomlike behavior of the perceptron. This allows us to drconclusions as to the optimal learning strategy in the pence of noise.

II. GENERAL FEATURESOF THE CLASSIFICATION TASK

The general task we are considering consists of the restruction of a mapping which classifies symbolic patteinto the appropriate class. The classification discriminaamongncl classes which will be labeled with greek lettea, b, g, . . . . More formally, the mappingMW consists ofthe set of associationsPj→ck from a space of objects~alsoreferred to as input patterns! P 5 $Pj% to anncl-dimensional space of classesC5$a,b,g, . . . %. The ob-jectsPj are strings ofW letters drawn from an alphabetA ofNS symbols; formally,Pj5$ l j

1 ,l j2 , . . . ,l j

W%. By a subpatternof Pj we mean any subset ofPj ; patternsPk andPl sharingsubpatterns of any length, i.e., such that for somei ,l k

i 5 l li ,

are said to be overlapping.The classck refers to the letter falling in the central pos

tion within the input window of sizeW. The training setMW is given in the form of two corresponding sequencSP5$ l 1 ,l 2 ,l 3 , . . . % ( l jPA) and SC5$c1 ,c2 ,c3 , . . . %,ciPC. The input window is shifted letter by letter until thwhole sequenceSP has been scanned. In any case the thedeveloped in the sequel is by no means restricted tospecific rule of pattern production. As far as the single-lecode is concerned, there exists a 1:1 correspondence betany letterl i and the components of anNS-dimensional binaryvector, having all components set to zero but the one cosponding to the desired letter~orthonormal code!. This codelends itself to creating an unbiased correspondence betwthe set of the possible letters and a set of labels each hanon-nil overlap only with itself. On the whole there aW3NS input neurons that take discrete values$0,1% accord-ing to whether they are activated or not. Each class is resented by a single neuron in the output layer. Output neurhave continuous and real-valued activations ranging

h-

Intaofs-ery

ta--

llee-.d-

s-

n-ss

s

ryisreen

e-

enng

e-nsn

]0,1@ ; collectively they form anncl-dimensional output vec-tor oW 5(o1 ,o2 , . . . ,oncl) ~see the Appendix!. We use thewinner-take-all strategy~see the Appendix! to extract the fi-nal classification from the actual output of the network. Tweights of the network are randomly initialized and itertively corrected by the standard error backpropagation arithm ~see the Appendix!. These specifics are sufficient tintroduce the general arguments and the model of Secand Sec. IV. For a more complete description of the arctecture of the network within the context of the predictionprotein secondary structures, the reader is referred to thependix.

III. CLASSIFICATION OF NOISE

To make this paper self-contained, it is convenientsummarize the results of our previous work. Following@3#we distinguish two kinds of noise which we term intrinsnoise (I noise! and representational noise (R noise!. I noisearises when the training set contains nonoverlapping ambous patterns, i.e., patterns that on distinct occurrencesclassified in different classes. This kind of noise is due toinherent ambiguity of the mappingMW ~the supervisor!. Inoise usually reflects the inadequate size of the input windsuch that typical markers of some patterns are missed.

R noise is a side effect of the single-letter orthonorminput code in that it arises as a consequence of the oveamong patterns associated with different classes. Therefohappens that the same subpattern is alternatively classifiedifferent classes as it occurs within different input patterConcisely,R noise is none other thanI noise that affects oneor more subpatterns rather than the whole pattern.

It follows that on processing an input pattern affectedR noise~ambiguous pattern!, the perceptron has to weigh thsimultaneous contributions of subpatterns that are no lonunanimous in indicating the same class. This kind of amguity originates eventually the randomlike behavior of tnetwork that is described in Sec. III A. Clearly, a prerequisfor R noise to be present is the mixing of subpatterns; inlimit of increasingR noise intensity, correlations among thletters forming the input patterns are progressively weake~see, for instance, Sec. V!. Let us remark that there may boverlap withoutR noise, when all the subpatterns point to tsame class. Finally, it is worth noting that occurrence ofRnoise depends on the input code: actually, one can alwdevise a new input code that nullifies overlap betweenrepresentations of any two patterns which, consequenmight be affected only byI noise. However, it would beerroneous to conclude that it is desirable to get rid ofRnoise; overlap, in fact, is the very basis of the generalizatcapability of perceptrons~as well as of other sorts of neuranetworks!. In the following, for want of any specificationswe shall use the term noise to refer to the joint effect oInoise andR noise.

A. Noise intensity and perceptron’s response

It is clear that were the mapping noiseless and wereweights initially set to zero, the unambiguous rule of asciation of any patternP to classci would cause the perceptron to generate asymptotically a binary 0/1 output, the o

sohethheco

fun

nore

o

n

io

rn

f

tiv

tars

btin

a-’sees.ye-ver

e

ap-

ct

d in

t isat-

by

al-

gerat-ylass

fo

ac

tiore

itionree-ningping

h

ues

ning

e

7336 55M. COMPIANI, P. FARISELLI, AND R. CASADIO

output neuron with unit activation corresponding to claci . Distinguishing patterns as ambiguous and unambiguis feasible if the mapping is known; yet this is usually not tcase and, in addition, it may be desirable to replaceall-or-none distinction with a new continuous criterion. Tidea is to rank patterns according to the distance of theresponding outputs from thed-like output of strictly unam-biguous patterns. A reliability scale obtains which is useto discriminate between reliable and unreliable patter

Qualitatively, the output vectoroW 5(o1 ,o2 , . . . ) of a reli-able pattern is strongly peaked on any classci , i.e.,ok'd ik , whereas the typical output of unreliable patterexhibits non-negligible spread of the activations on moutput neurons. A measure of pattern reliability is introducin Sec. V and generalized in Sec. VA.

Now we turn to characterizing the modes of operationthe perceptron as a function of the intensities ofI noise andR noise. When the mapping is affected byI noise the mean-

ing of the outputoW is susceptible to analytical investigatio@4,5,12#. It turns out that under batch updating~see Appen-dix! the backpropagation algorithm ensures the conditoiP→n(ci uP)'p(ci uP) wheren and p are the relative fre-

quency and, respectively, the probability of finding patteP in classci . In the rest of the paper we use superscriptPwhenever it is necessary to emphasize the dependence oparameter in question on the patternP. When the correctionsof the weights are accomplished according to the alternapattern updating procedure~see Appendix!, the convergencetheorem does not apply; nonetheless we have experimenverified that the learning algorithm still reproduces patteprobabilities~see Fig. 1!. Thus, in the event the network ifaced with a mapping affected byI noise, we are allowed todefine a first mode of operation of the perceptron that willreferred to as Bayesian or pattern-sensitive mode, sinceinformation stored by the network relates to the specificput pattern.

FIG. 1. Comparison of pattern frequencies with the outputs operceptron that has completed a 100 cycle training phase withthonormal code and pattern updating. The plot shows that thevation levelsoa

P (s) of a generic output neurona asymptoticallyapproach the relative frequenciesna

P (d) of each input patternP.The training set comprises 20 different patterns that are convenally indicated with letters on the abscissa. Analogous behavioexhibited by all of the output neurons, irrespective of their numb

sus

is

r-

ls.

sed

f

n

the

e

llyn

ehe-

Also pureR noise is the cause of unreliable classifictions, although the probabilistic meaning of the networkoutput cannot be any longer deciphered so easily sincRnoise interferes with the storage of Bayesian probabilitiAs a matter of fact it has been noted that estimation of Basian probabilities is better when one output dominates othe others@5# ~this is the case of reliable patterns!. The per-turbing effect ofR noise can be seen most clearly in thlimiting case of strong R noise. Expectedly,p(ci uP)→p(ci), wherep(ci) is the probability that classciis predicted oblivious of the patternP.

We can simulate this case by resorting to a random mping that is defined by associatingSP ~see Sec. II! with arandom sequenceSC . Randomization has the ultimate effeof maximizing the intensity ofR noise. A realization of thisexperiment on a version of the general problem describeSec. II is illustrated in Fig. 2, withC5$a, b, g%. SP is the setof amino acid sequences introduced in Sec. V, where iargued that it provides a well mixed set of symbolic subpterns.SC is generated by sampling the setC of the possibleclasses according to assigneda priori probabilitiespaA ,pb

A ,pgA . First, we order sequentially the input patterns

means of a labell, and plot the outputsoil as a function of

l. If we now smooth out the resulting rugged curve by cculating local averagesoi&, we get the plot of Fig. 2. Theinteresting outcome is that although the network is no lonable to record any statistical regularities per individual ptern, ^oi& reflect the extant piece of information carried bMW , i.e., the relative abundance of examples per each cin the training set,pa

A ,pbA ,pg

A .

ar-ti-

n-isr.

FIG. 2. Dependence of the average outputs on the composin classes of the training set. The plot shows the output of a thoutput perceptron upon presentation of the patterns of the traiset, after completion of the learning phase on a random map~following the usual procedure described in the Appendix!. To per-form this test we broke up the original letter sequenceSP into 62traits ~indicated with numerals on the abscissa! and calculated theaverage activationoi&, i5a,b,g, of the output neurons over eactrait ~marked, respectively, ass,n,d). The traits correspond tothe 62 proteins composing the training setL62 described in theAppendix. The plots show small fluctuations around mean valthat closely reflect the probabilitiespi

A with which the random as-signment to each class has been made in building up the traiset. In this example,pa

A50.25,pbA50.22,pg

A50.53. The averagevalues and standard deviations~evaluated over the 62 traits of thtraining set! turn out to be ^oa&50.2460.01, ob&50.2360.01, og&50.5260.01.

la

ystothfla

ros

as-

f

ld

aikmoe

aua

ffi

lrnn

nlarngaueeprin

eou

pe

hats auts-

-

s ifs

orknt.

ct,ron

the

us

rely,ytheingrefied

twos.f

antor-rns

of

sta-

pre-o-ncyveleup-

i-

the

the


The limiting case just discussed illustrates a mode of csification ~randomlike mode! which is antithetical to theBayesian mode. The most striking departure from the Basian mode consists in the network exhibiting, in a seninsensitivity to the input pattern and a critical sensitivitythe composition in classes of the training set. To definerandomlike mode it is convenient to introduce the notion orandom predictor, that is, a device which makes random csifications according to a probability density function~PDF!independent of the current input. The PDF specifies the pability pi

H , iP@1,ncl#, with which the predictor associatethe input pattern with classci .

A consistent definition of the efficiency of the networka random predictor is the probabilityP that the random prediction is correct:

P5(i51

ncl

piApi

H , ~1!

where thepiA , iP@1,ncl# specify the actual distribution o

structures in the test set. Clearly, Eq.~1! provides an alter-native expression of the prediction scoreQ whose generadefinition~see the Appendix!, at any rate, applies to any kinof predictor.

Following @3# we argue that the perceptron working onrandom mapping is equivalent to a random predictor,some average sense to be soon discussed. Strictly speathis equivalence is nonsense since the network is deteristic; instead, it makes sense with the proviso that it holdsaverage and from the point of view of the success scorQ@expressed as in Eq.~1!#.

To make this equivalence clear let us suppose that mdifferent training sets are formed by changing the individpatterns, but keeping fixed their compositionpa

A ,pbA ,pg

A andthe total number of patternsN. Now we train the perceptronon them separately and measure the efficiencyQ on the sametesting set. Changing the training set hardly affects theQvalue of the peceptron, provided the learning set is suciently large@5,7#. Otherwise stated,Q will undergo minorfluctuations around an average value; however, and herethe insensitivity to the input pattern, the unreliable pattethat are properly classified or misclassified vary upon chaing the training set. The reason for this is thatR noise has anunpredictable~i.e., training-set-dependent! influence on theclassification of the unreliable patterns. In conclusion, oQ and the number and identity of the reliable patternsinvariant with respect to the particular choice of the trainiset. On the contrary, as far as the unreliable patternsconcerned, only the number of the correctly classified inpis constant on average. This implies that the exact coursthe unaveraged and rugged curve from which Fig. 2 has bderived is unpredictable, whereas the smoothed curveserves the same global information irrespective of the traing set. For a perceptron the probabilitiespi

H are not givenapriori but are the outcome of the learning process. Theperiment on the random mapping suggests that there shbe some nonlinear functionF i relating pi

H to the composi-tion of the training set, i.e.,pi

H5F i(paA ,pb

A ,pgA). In Sec. IV

we make assumptions on the randomlike mode of theceptron in order to estimateF i .

s-

e-e,

eas-

b-

ning,in-n

nyl

-

iessg-

ye

retsofene--

x-ld

r-

The experiment illustrated in Fig. 2 makes it apparent tthe perceptron trained on the random mapping exhibitpurely randomlike behavior and that its efficiency turns oto conform to Eq.~1!. Under these conditions we have acertained that not only og&.ôa&,ôg&.ôb& but alsoogP.oa

P ,ogP.ob

P ,;P. This implies that the perceptron classifies systematically the input in the most frequent classg,that is to say pg

H51; from Eq. ~1! it ensues thatP5pg

A50.53. On the other hand, the same value obtainwe calculate the value ofQ as the fraction of correct guesse~see Appendix!.

For mappings that are partially affected byR noise it isreasonable to think that the overall behavior of the netwis a hybrid of the Bayesian and the randomlike componeThe Bayesian mode prevails in the classification of theNrreliable patterns, on which noise has only a minor effewhile the randomlike mode takes over when the perceptdeals with theNu unreliable patterns. ClearlyN5Nr1Nu ,whereN is the total number of patterns in the test set. Toaim of splitting the efficiencyQ in the contributions corre-sponding to the two modes, we slightly modify the previopartitionN5Nr1Nu of theN patterns into a new partitionN5N11N2. N1 andN2 are defined as the patterns that acorrectly predicted with unit probability and, respectivewith probabilityP,1. The two partitions are quite closelrelated to each other and would exactly coincide only inlimiting case of zero noise. This is suggested by the findthat patterns with increasing reliability are more and mosurely assigned to the correct class, as is well exemplibelow, in Table I. We take it that posingN1'Nr is quite agood approximation since the discrepancies between thepartitions involve a tiny fraction of the whole set of patternBy way of example,R noise may lead to the inclusion oincorrect reliable patterns among theNr reliable ones; this islikely to occur when the said test patterns have significoverlap with patterns of the training set belonging to increct structures. In the new partition this small set of patteis removed fromNr and categorized inN

2. Similarly, the setof unreliable and correct patterns is shifted from the settheNu to the set of theN

1 patterns. With the aid of the newpartition we represent the behavior of the network as atistical mixture of the two modes;

Q5N11N2P

N, ~2!

whereP is the probability that theN2 patterns are correctlyclassified when the perceptron operates as a randomlikedictor. Equation~2! shows that both the randomlike compnent and the Bayesian component contribute to the efficieof the perceptron in varying proportion according to the leof noise. With maximumR noise, as in the conditions of thrandom mapping, the Bayesian component is entirely splanted by the random component@Q→P asN1→0 in Eq.~2!#, whereas at zeroR noise the Bayesian component domnates (Q→1 asN1→N). It is clear that for Eq.~2! to bestrictly valid we must consider the partitionN1/N2: any-how, in the sequel, we find it more convenient to useapproximationN1'Nr since dealing directly with reliableor unreliable patterns makes it easier reasoning onmechanism of computation in the presence of noise.

nttioth

e

-hnnpthuteb

u

aceisi-

-

heneal

o

t

th

bl

lshe

adeof

uresc-ow

n-in

ressi-er-


IV. A STOCHASTIC MODEL FOR THE QUASIRANDOMMODE OF THE PERCEPTRON

The PDF$piH% characterizing the randomlike compone

of the perceptron establishes by successive approximaduring the learning stage. The main goal of this section iscalculation of thepi

H . This will be done by proposing astochastic mechanism that simulates the randomlike modthe perceptron as it classifies unreliable patterns.

We maintain that theNu patterns are exclusively contributed by the strong mixing of subpatterns. The ground for tis that under the typical working conditions, where the geeralization capabilities of perceptrons are stressed, the iwindow sizeW is much larger than the correlation lengamong subpatterns. Then we are allowed to think of thereliable input patterns as being made up of noncorrelasubpatterns. Accordingly, upon presentation of an unreliapattern, we consider then weightswi j (n5W) contributingto the activation of thei th output neuron, as a randomsample where the probability thatwi j is picked up is speci-fied by a distribution functionf i(wi j ); the distributionsf iand f k ( iÞk) are assumed to be independent. Thus the inpdependent part of the local field of each output neuron~seeAppendix! can be conceived as a random flight in the spof the weights terminating on the said neuron. The indepdent random walks provide therefore an effective mechanfor the synthesis of theNu patterns and, to a good approxmation, of theN2 patterns.

The probabilitiesf i(wi j ) are approximated by the histograms of the weightswi j linking the j th input neuron to thei th output neuron. This approximation is mitigated by tconsideration that the precise analytical form of the functiof i(wi j ) is immaterial to our model since, as we shall sbelow, only the first two moments enter explicitly the finexpression of$pi

H%.The next step is the calculation of the sumFi of the n

weights for each class,

Fi5(jwi j , ~3!

that, in virtue of the binary input, represents the local fieldthe i th output neuron plus the thresholdu i @see Eq.~A1! inthe Appendix#. To this aim we think ofFi as a random walkof n steps, whose magnitudes are chosen according toprobability f i(wi j ), with average^wi j & and variances i

2 .Then the pertinent expression for the probability thatvariableFi takes on the valueWi is @13#

Prob$FiP@Wi ,Wi1dWi #%

5~2pns i2!21/2expH 2~Wi2n^wi j &!2

2ns i2 J . ~4!

Therefore, assuming mutual independence of the variaWi , the joint probability thatn weights sum toWi ( iP@1,ncl#) is

p0~WW !5)i51

ncl

~2pns i2!21/2expH 2~Wi2n^wi j &!2

2ns i2 J . ~5!

nse

of

is-ut

n-dle

t-

en-m

se

f

he

e

es

For a three-output perceptron with output labei5a,b,g the probability that the network selects, say toutputa, according to the winner-take-all rule~Appendix! is

paH5Prob$Wa2ua.Wb2ub ,Wa2ua.Wg2ug%

5E2`

1`

dWaE2`

Wa2ua1ubdWbE

2`

Wa2ua1ugp0~WW !dWg .

~6!

Using Eq.~5!, Eq. ~6! can be cast in the form

paH5Aqaqbqg

p3 E2`

1`

dWaexp@2qa~Wa2A!2#

3E2`

Wa2ua1ubdWbexp@2qb~Wb2B!2#

3E2`

Wa2ua1ugdWgexp@2qg~Wg2C!2#, ~7!

where A5n^wa&, B5n^wb&, C5n^wg&, andqi5(2ns i

2)21.A more convenient form of Eq.~7! can be obtained by

means of the transformationWa2A5l,(Wb2B)Aqb5j,(Wg2C)Aqg5h,

paH5Aqa

p3E2`

1`

dl exp@2l2qa#

3FAp

2erf@~l1A2B2ua1ub!Aqb#

1E2`

0

exp~2j2!djG3FAp

2erf@l1A2C2ua1ug!Aqg]

1E2`

0

exp~2h2!dhG . ~8!

The calculations forpbH andpg

H run exactly in the same waywith a trivial permutation of the indices.

V. THE CASE OF PROTEIN STRUCTURE PREDICTION

Now we pass to applying the general considerations mso far to the task of predicting the secondary structuresproteins. The sequencesSP andSC defining the mapping tobe learned are the sequence of protein primary struct~residue sequences! and, respectively, the sequence of seondary structures assigned to any residue. The input windcan arrangeW letters ~residues! at a time, and is shiftedalongSP as prescribed in Sec. II. Additional details concering the protein structure task, other than those includedSec. II, are relegated to the Appendix.

For the purpose of predicting protein secondary structuthe size of the input window, usually including 10–20 redues, is normally insufficient to account for long-range int

c

ab

on

io

ad

rn

okeonrnlearnvrt theonxtgeaevgur

be

to

ig

iv

gh

ilit

:sin

theovenord-

fi-

r-re-oron-the

htto

eo-akemainove

tic

tillap-omumty.


actions which play a significant role in determining the struture of unreliable subpatterns@14,15#. Moreover, we canassume thatI noise is negligible because the probability ththe same pattern occurs twice is negligible in the availadata sets@16#.

R noise is well documented in the literature on predictimethods of protein structures~see @17#, and referencestherein! and is connected to the pseudorandom distributof several properties along the chain@16,18–20#. Hints to thequasirandom characteristics of the sequence of proteinsto be found in the notion that real sequences have beenscribed as ‘‘slightly edited random sequences’’@20#. Butquasirandomicity of sequences~caused by strong subpattemixing! is only a prerequisite forR noise. The next ingredi-ent is the substantial amount of unreliable patterns that, csistently with the definition given in Sec. III, are able to tadifferent conformations depending on the particular envirment. Characterization and counting of unreliable patteaccording to different criteria, are to be found, for exampin @14,16#. In @21# experimental evidence is presented thshort equivocal sequences exhibit intrinsic preference fogiven secondary structure that may be overridden by eronmental influences. Such interactions may also be pvided by the solvent or by distant residues that turn ouhave proximity relationships with the trait at hand within tthree-dimensional structure. This is the case of the strmodulation of theb-sheet propensity by the tertiary conte@22#. Finally, R noise is signalled indirectly by the findinthat the protein structure mapping is characterized by wcorrelations. This is suggested by the phenomenologicaldence that multilayer perceptrons are no better than sinlayer perceptrons at classifying protein secondary struct@23,24#.

Different degrees of unreliability of the patterns candistinguished by means of the reliability indexr introducedin @6# in the framework of a neural network approachprotein secondary structure prediction. For each patternP itis a real number ranging in the interval@0, 1@ and is definedas the absolute value of the difference between the two hest outputso1

P ando2P :

r5uo1P2o2

Pu. ~9!

In this section our aim is to proceed to a semiquantitattest of the model of Sec. IV by using our predictions ofahelices,b sheets, and random coils@7#. The weights of theperceptron after completion of the training stage~under theworking conditions described in the Appendix! are used toconstruct three histograms, each one collecting the weiconverging on each output neuron~Fig. 3!.

From the histograms the parameters of the probabfunction f i(wi j ) used in Eq.~4! are easily derived:wa&50.10,sa

250.37, ^wb&50.01,sb250.33, ^wc&520.16,

sc250.35. The thresholdsu i have the following values

ua53.16, ub52.16, anduc522.77. Inserting these valuein Eq. ~8! and its analogs for the other structures, and fixn5W517 we get pa

H50.223,pbH50.116,pc

H50.661. Bymeans of Eq.~1! and the composition in structurespa

A , pbA ,

andpcA of the learning (L) and test (T) sets~see Appendix!,

we eventually estimatePL50.432 andPT50.417. Finally,by using the experimental value@7# of the performance,

-

tle

n

ree-

n-

-s,,tai-o-o

g

ki-le-es

h-

e

ts

y

g

Q50.658 forN511361 residues ofL62 ~see the first row ofTable I! we calculateNr54520 andNr /N539.8% from Eq.~2!. The same procedure withQ50.628 forN56634 resi-dues ofT33 leads toNr52400 andNr /N536.2%.

At this stage a semiquantitative check of our model ofrandom component can be made by comparing the abfigures with theNr value obtained by direct counting. ITable I we have sorted the patterns of the training set accing to their reliability indexr defined in Eq.~9! and com-puted the number of patternsNk with r>k(k50,0.1,0.2,. . . ,0.9). Then we have measured the efciencyQk by using the perceptron trained on the wholeL62to predict theNk patterns.

Now it has been argued@25# that there exists an uppeboundQ'0.8860.09 beyond which there is no point in further improving on the efficiency of secondary structure pdiction since identically folded conformations leave room fabout 12% fluctuations in the secondary structures. A reasable conjecture is that those patterns that are predicted bynetwork with accuracy>0.88 are sufficiently reliable forthem to have strong intrinsic propensity toward the rigstructure. This property makes them eligible to correspondthe sought-for reliable patternsNr . Posing Nr'N(Q50.88) in Table I, we get forL62 a valueNr'2000 and forT33 Nr'700, that are in the same range as the above thretical estimate. Note that the agreement is better if we tinto account that perceptrons tested on this same task dotend to miss about 5% in performance so that the abthreshold forQ is to be placed around 0.83@23#.

A. Noise and entropy of the protein structure mapping

In Sec. III A we have shown that the exact probabilismeaning of the perceptron’s output is corrupted byR noisealthough some remnant of probabilistic information can sbe found even in the limiting case of the randomized mping. Though one expects that the output deviates frp(ci uP), our experiments on real proteins show that the sof the activities of the output neurons is very close to uniThe same kind of regularity has been reported in@5# for quite

TABLE I. The patterns of a typical learning setL62 and test setT33 are here grouped according to their reliability index (r) @see Eq.~9!#. N indicates the number of patterns withr not smaller than thevalue in the first column.Q is the efficiency of prediction of eachsubset of patterns when the perceptron is trained onL62.

L62 T33

r Q N Q N

0 0.658 11361 0.628 66340.1 0.695 9608 0.660 56100.2 0.730 8043 0.690 47170.3 0.763 6600 0.719 39090.4 0.792 5311 0.739 31930.5 0.820 4140 0.769 24490.6 0.847 3010 0.788 18050.7 0.872 1985 0.815 12190.8 0.909 1126 0.869 6880.9 0.940 361 0.942 224


FIG. 3. Normalized histograms of the weights converging on each of the three output neurons of the perceptron used to predicta, b, andcoil protein structures.

reca

uran

ago

, in

t,e

thex-it

s

different task domains and network architectures. This ismarkable since there is no explicit constraint for this to ocur. The averaged sums of the outputs for our standard tring and test setsL62 and T33 ~see the Appendix! are,respectively, 1.0570.11 and 1.0970.06 ~see Fig. 4!.

This property helps in devising an entropylike measfor each patternP since the set of corresponding outputs cbe viewed approximately as a scheme~in information theo-retic parlance@26#!. The usual expression

SP52 (i5a,b,g

oiPlnoi

P ~10!

provides a measure of reliability per single pattern. Avering SP over the set ofN patterns gives an average entropythe mapping

--in-

e

-f

S5(P

SP/N. ~11!

S can be viewed as a measure of noise intensity thatour special condition of nearly zeroI noise~Sec. III!, can beused as an indicator ofR-noise intensity. As a matter of faccomparingS values of the artificial random mapping and threal protein mapping@7# we verify thatSrandom50.96 andSL50.67 andST50.69 for L62 and T33, respectively. Ourentropylike measure of the mapping reconstructed byperceptron is also a useful indicator of the information etracted by the network during the training stage. Actuallydecreases from the initial valueS50.96 ~of the perceptronwith randomly initialized weights! to the final values 0.67–0.69 of SL and ST mentioned above. Consistently with it

ts
FIG. 4. The activations of the output units practically sum to one even in the presence ofR noise. The plot shows the sum of the outpu(s) for the first 4000 patterns of a three-output perceptron performing on the test set of protein sequencesT33 ~see the Appendix!.

as

eerkintoeseifihanat

epetileereigm

seo-orthoiict

o

in

-

m-t oon

eote

iso

tratein

itedumle-tiontingthe

. V,nsn-einingurwithiser-nsringghthec-tureoretis

ec-

herteinuralinggestn ofl in-the

osti-tein

ntthericres

thear,bal-

toandffi-ofngoredant


meaning,S takes on smaller values for subsets with increing values ofr. For instance, the small subset withr50.9~see Table I! hasS50.30.

VI. DISCUSSION

The present investigation focuses on the behavior of pceptrons used to study noisy mappings within the genframework specified in Sec. II. Our main scope is to mamore quantitative the general notion that noise blurs theformation carried by each pattern and eventually leadsdecline of the efficiency of classification. We have explorthe effect on the performance of the kind of noise that ariwhen patterns, belonging to different classes, exhibit signcant overlap with pronounced mixing of subpatterns. Tmajor result of the present paper is the definition of a rdomlike regime of the perceptron, which is useful to estimthe performance as the network is confronted with noisy~un-reliable! patterns. As far as the prediction score is concernwe have argued that in the presence of high noise theceptron works in randomlike mode and the determiniscomputation performed by each output neuron is equivato a one-dimensional random walk on the available conntions. The ideal conditions for observing the stochasticgime of the network are reproduced in the experiment of F2 where a perceptron has been trained on a randomizedping.

A suitable benchmark for the stochastic model propoin Sec. IV is provided in Sec. V by our experiments of prtein structure prediction. The effect of noise on the perfmanceQ is most clearly seen by comparing the scores asperceptron is tested on the same task but with different nintensities. Actually, the performance in the task of preding real protein structures~see Table I! is Qreal'0.6320.66, whereas in the prediction of the random mappingFig. 2 it reduces toQrandom50.53. The introduction of anaverage entropylike estimate of noise intensity in Sec. VAa first step toward a quantitative correlation of noise aperformance. The average entropyS is derived from the pat-tern entropySP that is reminiscent of similar structural entropies used in the literature~see, for example,@14#!. Theinteresting novelty is thatSP is directly evaluated from theoutput of the perceptron. The reduction fromQreal toQrandom correlates with the increase of noise froSreal50.6720.69 toSrandom50.96. It is apparent that the unreliable patterns are mostly responsible for the amounnoise of the mapping; accordingly, they correspond to ncoding patterns with high pattern entropySP and cause thefolding code to be fuzzy and nonlocal. It is a distinctivfeature of noncoding patterns that local interactions areminor importance and that their structure is essentially demined by the tertiary context.

In Eq. ~2! the overall performance of the perceptronsplit in two contributions and we have taken advantagethis representation to evaluate the numberNr of reliable pat-terns~by virtue of the approximationNr'N1). Before dis-cussing our estimates let us posit that we feel confidenour results although it might seem questionable that we dconclusions about the sequence-to-structure mapping onground of speculations made on the simplest possible feforward neural network. However, we maintain that the

-

r-ale-ads-e-e

d,r-cntc--.ap-

d

-ese-

f

sd

f-

fr-

f

ofwhed--

ferences made in the present paper are not seriously limin generality since the perceptron provides the maximpossible information that can be extracted from the singresidue sequence. Actually, in the structure classificatask, multilayer perceptrons either hand designed or starfrom scratch and statistical approaches do not improve onperformance of the perceptron@8,23,24#.

Some comments are in order as to the result of SecNr /N535240%, as well as to the role of reliable patterin protein folding. First we note that it is reasonable to idetify the reliable patterns with the coding residues of a protthat have been searched for in different works on the foldcode. In@14# it has been estimated that 60–70 % of the foresidue patterns are coding. The apparent discrepancyour range is mainly due to the larger probability that noaffects the pattern owing to our larger input window. A futher comparison can be made with the finding of simulatioof the folding process@27#. The estimated value of 25% fothe fraction of residues that presumably stabilize the foldnucleus compares satisfactorily to our estimate; this mihint at the possible involvement of reliable patterns in tformation of the folding nucleus, to the extent that this acompanies fast formation of elements of secondary struc@28#. This is, however, an open problem which needs mwork to be definitely answered@28#, although the event thareliable patterns take part in the early stages of foldingconsistent with their strong intrinsic propensities for the sondary native structure.

We are now in a position to answer the question whetthe essential uncertainty that affects the prediction of prosecondary structures is due to the inadequacy of the nenetwork approach or to intrinsic deficiencies of the mappunder study. The above considerations about noise sugthat there is an unavoidable upper bound to the predictiostructures that is inherent to the mapping based on locaformation and cannot be overcome by merely enlargingdata set. Our estimates of thepi

H indicate that the randomlikecomponent of the perceptron is polarized towards the mabundant class~random coil!. This reproduces the expermental observation that perceptrons working on the prostructure task overestimate the most abundant class~randomcoil! and underestimate the less frequent classes (a helix andb sheet! @7#. Thus it appears that the randomlike componeof a perceptron trained on a real data set, similarly toperceptron trained on a noisy mapping, is driven by genepieces of information such as the composition in structuof the training set.

Finally, our analysis gives useful suggestions as tobest conditions for studying noisy mappings. In particulwe are able to rationalize the common strategy to useanced training sets@7,25#. Actually, using training sets withquite similar fractions of the different structures amountsreducing the effectiveness of the randomlike componentto favor the emergence of reliable patterns. The overall eciencyQ may happen to slightly decline while the numbercorrect guesses is much more uniformly distributed amothe classes. Correspondingly, the overestimation of the mabundant class and underestimation of the less abunstructures is sensibly reduced@7#. This improves in particularthe prediction of the underrepresentedb patterns.

hecos-

f tilcra

tor

ngklerprohornin

-

,

gea

lin

e

-allt-ithardely

ed.shasis

cted

reshenssor-f


APPENDIX

In this appendix we give the full details concerning tperceptron architecture as well as the data base used towith the particular task of protein structure prediction illutrated in Sec. V. In this task domain the mappingMW goesfrom the space of primary sequences~residue sequences! tothe space of secondary structures described in terms othree major classesa helix, b sheet, and random co(ncl53). The input code envisages now one letter for earesidue of the primary structure. For proteins the natuchoice for the orthonormal binary code~see Sec. II! isNS520, each position within the 20-tuple correspondingone of the 20 possible amino acid residues. The input layedesigned to read patterns of 17 residues at a time (W517)and contains 173205340 input neurons.

The relevant information on the 62 proteins of the trainiset L62 is drawn from the Brookhaven Protein Data BanFor the sake of having results as homogeneous as possibthose reported in the literature, the set has substantial ovewith the sets used in most of the papers devoted to thediction of protein structures. The labels of the individual prteins are listed in@7#. The pertinent information on eacprotein includes the residue sequence and the atomic conates of the crystallized protein. Assignment of the secoary structure of each protein is done by processing the amacid sequence with theDSSPprogram described in@29#.

The learning setL62 has 11 361 residues and the following composition in structures:pa

A 5 0.25, pbA 50.22, and

pcoilA 5 0.53. The testing setT33 comprises 6634 residues

with paA 5 0.28, pb

A 5 0.22, andpcoilA 5 0.50. The feed-

forward network we have used is a perceptron with a sinlayer of adjustable weights. The output layer has three rvalued neurons with activation values in the range#0,1@, eachcoding for one of the three structures predicted. The non

s

.

ks

pe

he

hl

is

.to

lape--

di-d-o

lel-

-

ear transfer function associated with each output neuronc is

(11e2acP)21, whereac

P is the local field defined by

acP5 (

i51

WNS

wciI iP2uc , ~A1!

I iP being the activation value~either 1 or 0! of the i th inputneuron corresponding to patternP and uc the adjustablethreshold of thecth output neuron. The output turns out to bocP5 f (ac

P).The output is interpreted according to the winner-take

rule; this amounts to identifying the prediction of the nework with the class corresponding to the output neuron wthe highest activation. The learning algorithm is the standbackpropagation@30# which is operated with a learning ratd5 0.01. Initial values of the weights are chosen randomfrom the uniform distribution in the range@21022,1022#.

Two alternative learning strategies are usually envisagThe first scheme~batch updating! prescribes that the weightundergo a one-shot change each time the networkscanned the whole training set. The function to minimizethe cumulative deviation from the desired outputdc

p ,

E5(pE~P!5(

P(c

~ocP2dc

P!2. ~A2!

The second scheme dictates that the weights are correafter each pattern presentation~pattern updating!, the errorfunction being the functionE(P) of Eq. ~A2!. The latterprocedure was used in the predictions of protein structumentioned in Sec. V. The learning process is stopped wthe fractional change of the error function per cycle is lethan 531024. Throughout this paper we measure the perfmance of the network in terms ofQ, defined as the ratio othe correct guesses to the total number of patternsN classi-fied.

.

ys.

-

SA

@1# T.L.H. Watkin, A. Rau, and M. Biehl, Rev. Mod. Phys.65,500 ~1993!.

@2# B. Cheng and D.M. Titterington, Stat. Sci.9, 2 ~1994!.@3# M. Compiani, P. Fariselli, and R. Casadio, Int. J. Neural Sy

Suppl.6, 195 ~1995!.@4# D.W. Ruck, S.K. Rogers, M. Kabrisky, M.E. Oxley, and B.W

Suter, IEEE Trans. Neural Netw.1, 296 ~1990!.@5# M.D. Richard and R.P. Lippmann, Neural Comput.3, 461

~1991!.@6# B. Rost and C. Sander, J. Mol. Biol.232, 584 ~1993!.@7# P. Fariselli, M. Compiani, and R. Casadio, Eur. Biophys. J.22,

41 ~1993!.@8# R.C. Yu and T. Head-Gordon, Phys. Rev. E51, 3619~1995!.@9# M. Compiani, P. Fariselli, and R. Casadio, inProceedings of

the Workshop on Parallel Architectures and Neural Networ,edited by E.R. Caianiello~World Scientific, Singapore, 1991!,pp. 227–237.

@10# J.D. Hirst and M.J.E. Sternberg, Biochemistry31, 7211~1992!.

@11# M. Compiani, P. Fariselli, and R. Casadio, inNeural Networks

t.

in Biomedicine, edited by F. Masulli, P.G. Morasso, and ASchenone~World Scientific, Singapore, 1994!, pp. 313–332.

@12# M. Compiani, P. Fariselli, and R. Casadio, inApplications andScience of Artificial Neural Networks II, edited by S.K. Rogersand D.W. Ruck, SPIE Proc. Vol. 2760~SPIE, Bellingham,WA, 1996!, pp. 597–607.

@13# S. Chandrasekhar, Rev. Mod. Phys.15, 1 ~1943!.@14# S. Rackovsky, Proc. Natl. Acad. Sci. USA90, 644 ~1993!.@15# N.D. Socci, W.S. Bialek, and J.N. Onuchic, Phys. Rev. E49,

3440 ~1994!.@16# M.J. Rooman and S.J. Wodak, Nature~London! 335, 45

~1988!.@17# M.J. Sippl, J. Mol. Biol.213, 859 ~1990!.@18# P. Zielenkiewicz, D. Plochocka, and A. Rabczenko, Bioph

Chem.31, 139 ~1988!.@19# S.H. White and R.E. Jacobs, Biophys. J.57, 911 ~1990!.@20# O.B. Ptitsyn and M.V. Volkenstein, J. Biomol. Struct. Dynam

ics 4, 137 ~1986!.@21# L. Zhong and W.C. Johnson, Jr., Proc. Natl. Acad. Sci. U

89, 4462~1992!.

i,

n

-

-


@22# D.L. Minor, Jr. and P.S. Kim, Nature~London! 371, 264~1994!.

@23# N. Qian and T.J. Sejnowski, J. Mol. Biol.202, 865 ~1988!.@24# F. Vivarelli, G. Giusti, N. Villani, R. Campanini, P. Farisell

M. Compiani, and R. Casadio, Comput. Appl. Biosci.11, 253~1995!.

@25# B. Rost, C. Sander, and R. Schneider, J. Mol. Biol.235, 13~1994!.

@26# A.I. Khinchin, Mathematical Foundations of Informatio

Theory~Dover, New York, 1957!.@27# V.I. Abkevich, A.M. Gutin, and E.I. Shakhnovich, Biochemis

try 33, 10 026~1994!.@28# A.M. Gutin, V.I. Abkevich, and E.I. Shakhnovich, Biochemis

try 34, 3066~1995!.@29# W. Kabsch and C. Sander, Biopolymers22, 2577~1983!.@30# D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Nature~Lon-

don! 323, 533 ~1986!.

Noise and randomlike behavior in perceptrons: theory and application to protein structure prediction

Documents

Transcript of Noise and randomlike behavior in perceptrons: theory and application to protein structure prediction