Improving protein identification from peptide mass fingerprinting through a parameterized...

AbstractWe have developed a new algorithm to identify proteins by means of peptide mass �ngerprint-ing. Starting from the MALDI-TOF spectra and environmental data such as species, isoelectricpoint and molecular weight, as well as chemical modi�cations or number of missed cleavages ofa protein, the program performs a fully automated identi�cation of the protein. The �rst step isa peak detection algorithm, which allows precise and fast determination of peptide masses, evenif the peaks are small or if they overlap. In the second step the masses and environmental dataare used by the identi�cation algorithm to search in protein sequence databases (SWISS-PROTand/or TrEMBL) for protein entries that match the input data. For that, a list of candidateproteins is selected from the database, and a score calculation provides a ranking accordingto the quality of the match. In order to de�ne the most discriminant scoring calculation weanalyzed the respective role of each parameter following two directions. The �rst one is basedon �ltering and exploratory e�ects, while the second direction focuses on the levels where theparameters intervene in the identi�cation process. Thus, according to these two directions, allinput parameters contribute to the score, however with di�erent weights. Since it is di�cult toestimate the weights in advance, they have been computed with a genetic algorithm, using atraining set of 91 protein spectra with their environmental data. We tested the resulting scoringcalculation on a test set of 10 proteins and compared the identi�cation results with those ofother peptide mass �ngerprinting programs.

Improving protein identi�cation from peptide mass �ngerprintingthrough a parameterized multi-level scoring algorithm and anoptimized peak detectionRobin Gras1, Markus M�uller1, Elisabeth Gasteiger1, Steven Gay1, Pierre-Alain Binz1,William Bienvenut2, Christine Hoogland1, Jean-Charles Sanchez2, Amos Bairoch1,Denis F. Hochstrasser2, Ron D. Appel11 Swiss Institute of BioinformaticsUniversity Medical Center, GenevaSwitzerland2 Central Clinical Chemistry LaboratoryGeneva University Hospital, GenevaSwitzerlandCorrespondance: Dr. Robin Gras, Swiss Institute of Bioinformatics, 1, rue Michel-Servet,CH-1211 Geneva 4 - Switzerland. E-mail, [email protected]: Mass spectrometry, peak detection, peptide mass �ngerprinting, identi�cationAbbreviations: TOF, time-of- ight;Running title: Improving Protein Identi�cation from Peptide Mass Fingerprinting1 IntroductionOne of the tasks of proteomics is to identify the proteins expressed by an organism or tissue [1].This requires several steps. The proteins are �rst isolated and some protein-speci�c attributesare measured. A protein sequence database is then screened in order to retrieve the protein orproteins that best match these attributes. Until recently, the attributes were most commonlydetermined by chemically extracting amino acid sequence information [2]. While these methodsare very reliable and can be fully automated, they are slow and do not allow for high throughputidenti�cation. Hence new techniques for protein identi�cation had to be developed. A majorimpulse came from mass spectrometry of large molecules. New methods such as matrix assistedlaser desorption and ionization (MALDI) [3] [4] and electrospray ionization (ESI) [5], as well as newspectrometers [6] became available and made it possible to analyse proteins in small concentrationsin a short time. Among the various spectrometric methods are fourier transform mass spectrometry(FTMS) [7], which provides a high mass resolution, and quadrupole time-of- ight (QTOF) [8],where ions of a small mass range are selected by a quadrupole ion trap and then transferred to acollision chamber before their fragments are analyzed. Further there are re ectron time-of- ight2

spectrometers (MALDI-TOF and ESI-TOF), which allow to measure masses in a large range withsu�cient precision.The currently most common method to identify proteins is to �rst enzymatically digest theproteins, then to determine the masses of the resulting peptides by peak detection on a MALDI-TOFor ESI-TOF spectrum, and �nally to use the peptide mass �ngerprints to search protein sequencedatabases for correct matches. Optimizing the peak detection and database search algorithms isthus the key to improve protein identi�cation from peptide mass �ngerprints.1.1 Peak detectionPeak detection is a important step in the identi�cation process. As sometimes only a few exper-imental peptide masses in the �ngerprint match the theoretical masses in a database, failure todetect one peak can hinder the correct identi�cation of a protein. On the other hand, if too manyfalse peaks are considered, this may lead to erroneous database matches. Furthermore, it is impor-tant to precisely determine the peptide masses. The peak detection algorithm must also be able tocorrect calibration errors of the mass spectrometers. Finally, the process of peak detection shouldbe fast and fully automated in order to allow for high throughput data handling.1.2 Identi�cationThe principle of protein identi�cation using peptide mass �ngerprinting is based on the comparisonof the list of experimental masses with a database containing the theoretical peptide masses ofknown proteins. The goal is to �nd the protein or proteins whose peptide masses provide the bestmatch with the experimental �ngerprint. It is worth mentioning that several other attributes ofproteins may be useful to characterize the likeness between the protein under investigation andidenti�cation candidates from the database [9]. Information about the species, the molecular massor the isoelectric point of the whole protein can be very helpful to select the right protein. Chemicalmodi�cations caused by biochemical mechanisms in the living cell or during the preparation of theexperiment modify the peptide masses and also have to be taken into account while parsing thedatabase.Several programs exist that perform this kind of protein identi�cation. They all use someof the available attributes and search various protein sequence databases. The critical ques-tion of this approach is to present the user with a ranking of the proteins that match the pro-tein under investigation, which considerably facilitates the interpretation of the identi�cation re-sults. Most programs show scores associated with each protein, thus giving a degree of con�-dence in the matching protein. The simplest scoring method is to count the number of match-ing peptide masses. This is applied by the PeptideSearch program (http://www.mann.embl-heidelberg.de/Services/PeptideSearch/FR PeptideSearchForm.html) which searches the nrdbdatabase, as well as by the PeptIdent program [10] (http://www.expasy.ch/tools/peptident.html)which searches the SWISS-PROT and TrEMBL databases [11]. In addition, PeptIdent uses some ofthe annotations from SWISS-PROT to re�ne its search, taking into account known protein modi�-cations (post-translational and processing of precursor molecules into mature chains and peptides).The Mowse program (http://srs.hgmp.mrc.ac.uk/cgi-bin/mowse) [12] determines a score by consid-ering the frequency of each peptide mass in its database (OWL), in order to give a stronger weight tothe rarest peptides. This score also takes into account the presence of missed cleavages in matchedpeptides, modifying their weight in the score by a �xed factor (pFactor). The MS-Fit program(http://prospector.ucsf.edu/ucsfhtml3.2/ms�t.htm) uses the same scoring method on the (non re-dundant) NCBInr database. ProFound [13] (http://prowl.rockefeller.edu/cgi-bin/ProFound) calcu-lates a probability for the identi�cation of the right protein, given by a bayesian formula, and usesthe distance between experimental and theoretical masses obtained from the NCBInr database.

Finally, the MassPro�le program, included in the Darwin library (http://cbrg.inf.ethz.ch/) [14]also determines an identi�cation score based on the probability to randomly obtain a match ofn experimental masses with n theoretical masses, given the interval of possible masses and themaximum allowed distance of masses accepted in this match.All these algorithms utilize the various attributes presented above to control the number ofproteins considered for the identi�cation. However, they make little use of this information intheir scoring calculation, since they use at most one or two of the attributes (distance betweenmasses, presence of missed cleavages, mass distribution in the database, ...). These represent onlya small part of the parameters that could in uence the quality of identi�cation. In order to betterunderstand their respective role, we carried out a systematic study of the importance of eachattribute in the identi�cation process. This led to the de�nition of a new scoring scheme that takesinto account maximal information from each attribute, thus allowing for a better discrimination ofcandidate proteins and facilitating the identi�cation of the right protein.This paper �rst presents an optimized automated peak detection algorithm and then details anew protein identi�cation method, as well as its associated scoring procedure.2 Materials and methods2.1 Materials2.1.1 ChemicalsSDS-PAGE molecular weight standards were purchased from Bio-Rad Laboratories (Hercules,CA, USA). Sequencing grade-modi�ed trypsin was purchased from Promega (Madison, WI,USA). Tri uoroacetic acid (TFA), �-cyano-4-hydroxy-trans-cinnamic acid (ACCA) were pur-chased from SIGMA (St-Louis, MO, USA). Acetonitrile (AcCN) HPLC grade was purchased fromFlucka (Buchs, Switzerland). Methanol puriss pa and sodium bicarbonate were purchased fromMerck (Darmstadt, Germany). Immobilized pH gradient strips were purchased from Amersham-Pharmacia-Biotech (Upsala, Sweden).2.1.2 Sample preparationBio-Rad molecular weight standards were separated by 1D-SDS electrophoresis [15]. The otherprotein samples were separated by two-dimensional gel electrophoresis (2-DE) [16]. Gels werestained with Coomassie Brilliant Blue (CBB) R250 (0.1% w/v), methanol (30% v/v) and aceticacid (10% v/v) for 30 minutes and were destained with repeated washes of methanol (40% v/v) andacetic acid (10% v/v) solution. Protein spots were excized and destained with 100 �l of a 50 mMammonium bicarbonate solution at 37 oC for 45 minutes. Destaining solution was removed andthe gel pieces were dried under vacuum. The gel pieces were generally reswollen with 20 �l of 20mM ammonium bicarbonate and 4 �l of 0.1 mg/ml of trypsin. Usually after over night incubationat room temperature, the gel was dried to evaporate solvent and volatile salts. Then 20�l of 50%AcCN, 0.1% TFA were added for 10 minutes with sonication to extract peptides from the gel.2.1.3 Mass spectrometryMass spectrometric measurements were performed on a MALDI-TOF mass spectrometerV oyagerTMElite (PerSeptive Biosystems, Framingham MA, USA) equipped with 337 nm nitrogenlaser. The analyser was used in the re ectron mode at an accelerating voltage of 18 to 20 kV, adelay of extraction of 100 to 140 ns. Laser power was generally set about 20% above threshold

for matrix molecular ion production. Spectra were accumulated between 10 and 256 times. Thematrix solution used was 4 mg/ml ACCA in 30% to 50% AcCN, 0.1% TFA.2.1.4 Computer hardwareThe programs are written in ANSI C++ and run on Unix and Windows systems. We also developeda Perl script that allows to run the peak detection on a Windows PC and the database search ona Unix server.2.2 Peak Detection in MALDI-TOF Mass Spectra2.2.1 IntroductionA MALDI-TOF spectrum is a signal, i.e. an array of oating point values that consists of a trend,noise and peaks. The trend is the signal produced by the electronics of the mass spectrometer thatone would obtain if no material entered the mass spectrometer and in the absence of noise. It doesnot vary over small mass ranges (� 10 Dalton). The noise, which is caused by electronic disturbancesand fragments of material, varies over small mass ranges (< 1 Dalton) with little correlation, i.e.each array value varies randomly and almost independently of its neighboring values. Peaks havea more or less prede�ned shape (see below) and are therefore strongly correlated. The notion of apeak may be misleading, because one \peak" actually consists of several so-called isotopic peaks[6] (Fig. 2 and 5).Determining the monoisotopic mass of peaks is an old task [17] [18], and a software accom-plishing this task is usually delivered together with the spectrometer hardware. Unfortunately,the software we had at hand was lacking the necessary exibility and accuracy. A custom peakdetection program had therefore to be conceived. It has been designed so as to yield a precise andfast localization of all the peaks, even if these are small or if several peaks overlap.In order to detect peaks in a spectrum, we can apply a regression algorithm [19] [20]. Let t� bea template for a peak, where � is a set of parameters (such as height and width). Then a peak isdetected in a spectrum if its template �ts a part of that spectrum, i.e. if the distance between thespectrum and the template js� t�j has a local minimum and is smaller than some threshold value.The choice of t� is crucial. Three conditions must be ful�lled. Namely, 1) the match shouldbe clear (high signal to noise ratio), 2) it should be precise (low deviation due to noise) and 3) itshould be unique (local minima not too close to each other). The �rst two conditions can be solvedanalytically. They yield t� = p�, where p� denotes an ideal peak as it would appear in a noiselesssignal. The third condition results in blurring the template t�, i.e. reducing its high frequencypart and thereby smoothing the error function. Hence it competes with the other two conditions.Canny [21] developed this theory for the case where only one pattern with a �xed shape is present.But as we will see below, we have to deal with peaks of variable shape, and the template has toadapt its form to the signal.2.2.2 Application to MALDI-TOF Mass SpectraSince multiply charged peptides are rarely observed in MALDI-TOF spectra, we can assume that allpeptides carry a single charge, so the spacing between isotopic peaks is 1 Dalton. As the height ofthe peaks varies signi�cantly, some disappear in the noise. At least in the case where the spectrumcontains a lot of peaks, the noise is partially made of small peaks that overlap and can thereforeno longer be detected (Fig. 1).An isotopic distribution de�nes the probabilities that a molecule carries additional neutrons.For peptides of the same mass, there are several possible isotopic distributions. They depend on the

1155 1160 1165 1170 1175 1180 1185 1190 1195 12000

1000

2000

3000

4000

5000

6000a

m (M/Z)

Cou

nts

2050 2060 2070 2080 2090 2100 2110 2120100

200

300

400

500

600

700b

m (M/Z)

Cou

nts

Figure 1: Mass spectrum of the Escherichia Coli protein Prolyl-tRNA synthetase (SYP ECOLI,P16659) digested with trypsin. The spectrum was acquired with a V oyagerTMElite MALDI-TOFmass spectrometer. Note that the mass unit is M=Z, where M is the mass in Dalton and Z theamount of unit charges of a peptide. Here we consider only peptides with Z = 1. a) shows a partof the spectrum with a lot of peaks, while b) shows a noisy region. The �� line shows the trend,while the upper and lower solid lines show the trend plus and minus the noise, respectively.atomic composition and particularly on the number of sulfur atoms [6]. However, the di�erencesare scarcely visible in a mass spectrum, because the atomic compositions of peptides with the samemass are similar. Thus we only consider an average isotopic distribution calculated by averaging allpeptides with a mass in [m;m+ 1] that are obtained from in silico digestion of the proteins in theSWISS-PROT database [22]. Let pIsom (i); i 2 [0::1[ denote the probability of having i additionalneutrons in this average distribution. The probability of the monoisotopic peak pIsom (0) decreaseswith increasing mass m (Fig. 2). This feature is crucial for a correct peak detection in di�erentpeptide mass ranges.A template can now be obtained from pIsom :t�(m0) = tm;ho;h;�(m0) = ho + h � 1Xi=0 pIsom (i) � e� (m0�m�i)22� ; (1)where � is the o�set ho, h the height, m the monoisotopic mass and � the width. The o�set ho isnecessary to correct errors in the trend estimation (see section 2.2.3).Let us now de�ne the error function e :e(m) = 1h2 � (m1 +m2) minho;h;� mi�m+m2Xmi�m�m1(s(mi)� tm;ho;h;�(mi))2; (2)

998 999 1000 1001 1002 1003 1004 1005 1006 10070

0.2

0.4

0.6

0.8a

mono

1998 1999 2000 2001 2002 2003 2004 2005 2006 20070

0.1

0.2

0.3

0.4b

mono

2998 2999 3000 3001 3002 3003 3004 3005 3006 30070

0.1

0.2

0.3

0.4c

m’ (M/Z)

mono

Figure 2: Average isotopic distributions tm;ho;h;�(m0). The monoisotopic part is marked with thelabel 'mono'. a) For m = 1000 Dalton, b) m = 2000 Dalton and c) m = 3000 Dalton. � = 0:28,ho = 0 and h = 1.where mi are the sampling values of the spectrum and m1 = 1 and m2 = 5 de�ne the windowin which equation (2) is evaluated. It must be large enough to contain the bulk of tm;ho;h;�(m0).However, if it is too large, theP s2(mi) term would dominate, and the contribution of tm;ho;h;�(m0)to e(m) would become marginal. The division of the integrand by h2 � (m1 + m2) normalizes ewith respect to the height and the size of the template. We normalize with h2 because the shapeof the template is an average of all possible shapes, and thus does not need to match a peak, evenin the absence of noise. This deviation grows linearly with the height, and we reduce its e�ect bythe normalization. The task is now to �nd conditions for the error function e that characterizesthe peaks. The �rst condition is @e@m = 0, i.e. e has a local minimum with respect to m. Usually,several local minima mi of equation (2) are found in a neighborhood (Fig. 3), and we accept mpas a possible peak if e(mp) � e(mi);8mi 2 [mp � (m1 +m2)=2;mp + (m1 +m2)=2].From these we select those as real peaks which satisfy the following conditions :e(mp) < emaxh(mp) > hmin � n ) (3)where emax and hmin are thresholds and n is the estimation of the noise around the peak. Sincethe quality and height of the peaks vary a lot, perfect values for emax and hmin that are able todistinguish all 'true' peaks from noise do not exist. If the values are restrictive, i.e. if emax is smalland hmin is large, no 'false' peaks are detected, but we also loose some 'true' ones. Conversely, byincreasing emax and decreasing hmin, more and more 'false' peaks appear (Fig. 6).The values of emax and hmin are linked to the values of the parameters one has to choose forthe database search. For example, if the search values are restrictive, i.e. the mass tolerance islow and the minimal number of peptide masses that must match is high, the values of emax andhmin must be less restrictive in order for all 'true' peaks to be taken into consideration. The 'false'peaks do not change the result if they are not too abundant, because the probability that several

1570 1571 1572 1573 1574 1575 1576 15770

0.5

1

1.5

2x 10

4

Cou

nts

m (M/Z)

1570 1571 1572 1573 1574 1575 1576 1577−3

−2

−1

0

1

2

3

lg(e

)

m (M/Z)Figure 3: Error function (2) in the vicinity of a peak at 1572.835 Dalton. The lowest minimumcorresponds to the monoisotopic peak. Note the logarithmic scale of e.of them match the same protein, thus giving rise to a high score for this irrelevant protein, is low.On the other hand, if the database search parameters are less restrictive, emax and hmin must notallow many 'false' peaks, because the probability of a false match is now higher and the result maychange qualitatively.2.2.3 ImplementationThe �rst step is to remove the trend because the height of the peaks is only de�ned relative to thistrend, and because in this way the numerical accuracy is improved. Ripley [20] describes severalmethods. The crucial point is the robustness of a method, i.e. the trend must not follow localizeddeviations, like peaks. Here we choose a simple approach to attain a robust �t. The spectrum issplit into several small windows si (width about 40 Daltons), and smedi and slowi are calculated ineach window, where smedi and slowi are de�ned as follows : 50% of the values in si are larger thansmedi and 50% are lower, while 95% of the values in si are larger than slowi and 5% are lower. Thenwe de�ne the noise as ni = 2 �(smedi �slowi ). Finally, smedi and ni are interpolated using cubic splines[19] to obtain the continuous trend smed(m) and noise n(m) (Fig. 1).If m and � are given, it is straightforward to calculate ho and h that minimize equation (2).Hence we have to seek the minima ei only in m and �. Since this algorithm is also used forhigh throughput processes such as the molecular scanner [23], execution time is crucial. A directevaluation of equation (2) is too slow, and it is therefore necessary to perform a fast �rst searchto �nd starting points for a more extended search. This �rst search is done by �xing � = 0:2(this value neither produces too many minima, nor does it blur e too much (see section 2.2.1))and evaluating equation (2) for masses where the signal exceeds the noise. Because the templatet varies slowly with the mass, it does not have to be evaluated for each mass. We then calculatethe minima of equation (2) and use the resulting masses as starting points for a more precise �t,where both m and � vary.It is possible that two peptides have similar masses, so that their peaks overlap, even if they areclearly distinguishable from the noise. In this case, the method described above may fail to detectboth peaks. To solve this problem, all detected peaks are subtracted from the spectrum, and thealgorithm is applied a second time.

2.3 CalibrationThe time of ight measured by a MALDI-TOF mass spectrometer can be a�ected with a signi�canterror. After converting the time of ight into the peptide mass [6], this can yield an error of up to1 Dalton. However, most of that error can be corrected afterwards by a linear transformation :mcalib = a �m+ b: (4)The coe�cients a and b can be determined in three di�erent ways. 1) They are de�ned ex-ternally. 2) They are calculated using internal standards, i.e. peptides with known masses thatappear in the spectrum. This method works very well if internal standards are present and if theyare visible in the spectrum. Then it allows to reduce the error to values smaller than 0.05 Dalton.3) They are calculated with a maximum likelihood method. This method is based on the fact thatthe mass distribution of peptides is not at all uniform. Firstly, the distribution is peaked at certainmasses separated by 1 Dalton, and secondly it falls o� with higher masses (Fig. 4). For a detaileddiscussion see [22]). Let P (m)�m be the probability of �nding a mass in [m;m +�m]. For a setof peaks with masses mi, a and b are chosen to maximize the total probabilityPi P (a �mi+ b)�m.This method is independent of internal standards, but it only works for initial errors that aresmaller than 0.5 Dalton, making the error less than 0.2 Dalton in most cases.500 1000 1500 2000 2500 3000 3500 40000

200

400

600

800

1000

Dalton

Cou

nts

a

900 905 910 9150

200

400

600

800

Dalton

Cou

nts

b

Figure 4: a) Distribution of theoretical peptide masses obtained by a virtual digestion of all pro-teins in SWISS-PROT with trypsin (without missed cleavages or modi�cations). Note how thedistribution falls o� for higher masses. b) Detail of a). The distribution is peaked at given masses.2.4 Identi�cation by peptide mass �ngerprinting2.4.1 ProblemsIdenti�cation by peptide mass �ngerprinting uses a set of experimental peptide masses obtainedfrom the mass spectrum after peak detection, as well as information about the species, the isoelectricpoint or the molecular mass of the searched protein. These experimental masses are compared toa database of peptide masses, i.e. a database of in silico digested proteins. The identi�cation

algorithm searches for the protein with the best match between its theoretical peptide masses andthe experimental masses. Other attributes of the searched protein are also taken into considerationand matched to their corresponding values in the database. This method involves various problemsthat in uence the quality of the identi�cation.First of all, we need to know which database to use, and how it has to be parsed. There aretwo approaches to use a database: either we search a database of protein sequences which is parsedlinearly, each sequence being virtually digested to progressively determine the peptide masses, or webuild an index of all possible peptide masses (sorted in ascending order) by an o�-line digestion of aprotein sequence database. In both cases we consider possible modi�cations and missed cleavages.The �rst method has the advantage of using less disc space, because everything is calculated on-line,and thus does not need to be stored. It also easily allows (by changing digestion rules) to considerdi�erent enzymes for the digestion. Nevertheless, it could require a longer parsing time due tothe fact that digestion operations and, especially, all the combinations of modi�cations that couldoccur upon peptides have to be computed on-line. The second method has the advantage to retainall possible peptide masses and therefore to avoid the combinatorial treatment of modi�cationsduring the search. Its drawbacks are the considerable additional space needed to store the index,as well as the time necessary to update it.A second problem arises from the large number of parameters that in uence the identi�cationprocess. Indeed, as we have already seen, modi�cations and missed cleavages can occur upona protein and modify its peptide masses. If we allow for each theoretical peptide to carry no,one or several modi�cations and for the enzyme to miss 0, 1 or 2 cleavage sites, this stronglyincreases the number of theoretical peptide masses the program must consider. Also, identi�cationalgorithms use various thresholds that can appreciably modify the search results. Examples arethe molecular weight range, the pI range, the minimum number of matches, the allowed di�erencebetween experimental and theoretical masses (mass tolerance), etc. For these reasons, the usershould usually have an a priori idea of the experimental context, so that an optimal choice of theparameter values will facilitate the interpretation of the results.The third problem results from the way the resulting proteins are ranked by the identi�cationalgorithm. Depending on the parameters that have been selected for the search, the number ofproteins in the database that match the experimental data can be very large. The program musttherefore associate a score with each candidate protein, and thus allows to quantitate the con�dencein its match.2.4.2 ParametersIn order to choose an e�cient method to handle the above-mentioned problems, the way parametersare used has been formalized. This has implications on the choice of the database structure and onthe calculation of the score associated to each candidate protein. Parameters can be characterizedin two ways. The �rst one concerns their e�ects on the quality and e�ciency of the identi�cation.The second one is linked to the level in the identi�cation process at which a parameter intervenes.Regarding the �rst way, one considers the fact that the parameters have two opposite e�ectsduring the search: an \exploratory" e�ect and a \�ltering" e�ect.The exploratory e�ect allows to increase the size of the search space, that is to increase thenumber of candidate proteins. Indeed, the �rst di�culty of the identi�cation is to be sure toinclude the correct protein in the list of candidate proteins. Therefore the tolerance in the set ofconsidered proteins and masses must be high enough to �nd the right protein. Parameters that areinvolved in this class of e�ects are: the type and number of modi�cations applied to proteins in thedatabase, the maximum number of missed cleavages, the maximum distance between experimental

and theoretical masses, the minimum number of matched peptides necessary for a protein to beselected, and the number of peaks returned by the peak detection program.The second di�culty of the identi�cation is to minimize the number of candidate proteins,in order to avoid loosing the searched protein in the list of results. Therefore, it is importantto e�ciently �lter the results and eliminate the least likely proteins from the list of candidates.Parameters with such a �ltering e�ect include the species (to reduce the number of proteins to beconsidered), the molecular mass and the isoelectric point (to eliminate proteins whose values aretoo far from the experimental ones). Moreover, some of the parameters mentioned above for theirexploratory e�ects, like the maximum distance of masses, the minimum number of matched peptidesor the number of detected peaks, also have some �ltering e�ects, depending on the thresholds.As we can see, the main di�culty consists in �nding a compromise between these two aspects.On one hand, one wants to be sure to consider enough candidate proteins, therefore one hasto increase the exploratory e�ect. On the other hand, one seeks to clearly identify the rightprotein and therefore has to �lter the results. Depending on their exploratory or �ltering nature,parameters may have a notable e�ect on the processing time needed for the identi�cation. Themore exploratory e�ects are used, the longer the search time will be. The sooner the �lteringe�ects are applied, the shorter it will be. The quality and e�ciency of the identi�cation will thushighly depend on the choices of the parameter values.The second way of characterizing parameters is based on the levels at which parameters par-ticipate in the identi�cation process. In the case of two-dimensional electrophoresis (2-DE), threelevels can be considered. The �rst one, the \mass level", corresponds to the choice of masses usedto match a protein. At this level, we want to characterize the degree of match between a massfound in the spectrum and the mass of a peptide of the search protein. The next level, the \proteinlevel", consists in the identi�cation of a protein at a given position in the 2-DE gel. Informationfrom the mass level is coupled with information about the whole protein, in order to determine thebest candidate protein. Finally, at the \contextual level", information about the two-dimensionalenvironment (context) of the selected proteins from level 2 are taken into account to re�ne theidenti�cation at each position in the gel.At the mass level, the �rst goal is to determine the quality of a peak, that is to determine whena peak may be considered to be a \true" peak. For that purpose, parameters such as peak intensity,peak width or the peak's �t with a theoretical isotopic pro�le (see section 2.2) can be used. A levelof con�dence is also de�ned for the match of an experimental mass with a theoretical mass in theprotein database. This is achieved with the help of parameters such as the number and type ofmodi�cations, the number of missed cleavages necessary for the match, or the hydrophobicity value(GRAVY) [24] of the corresponding peptide. The latter allows to estimate the probability to �ndthe peptide in the mass spectrum (the hydrophobicity value is important for the ability of a peptideto y in the mass spectrometer).At the second level, the protein level, we search in the set of candidate proteins for the proteinshowing the best correspondence with all information available from the gel and the spectrum.Values obtained at the mass level, as well as parameters describing the whole protein can be used.Such parameters are the molecular mass and isoelectric point, but also the percentage of the proteinsequence that is covered by peptides identi�ed at level 1, or the standard deviation of the distancebetween theoretical and experimental peptide masses.The contextual level allows to adjust the identi�cations obtained from the previous levels bytaking into account the environment. For each position in the 2-DE gel where identi�cation isattempted, the points in the neighborhood are considered. The distribution of the masses used forthis identi�cation, the distribution of the identi�ed proteins, as well as of the parameters used in theprevious steps are considered [23]. This method allows to validate or invalidate certain parameters,

thus altering the results of the previous levels. In this way, one can imagine an iterative methodthat gradually re�nes the identi�cation by successive application of the three levels.2.4.3 The algorithmInitial choicesAs we have seen, the choice of parameters as well as the point in time where they are used isdecisive for the e�ciency of the search. The choice of parameters has to take into account thecompromise between the sensitivity and selectivity of the search. Moreover, the calculation ofan identi�cation score has to take into account the nature of parameters and the level at whichthey intervene. A preliminary study showed the importance of these parameters (see 3.2.1). Wetherefore developed a new identi�cation tool based on the role and the relative importance of thevarious parameters, in order to determine a score allowing the best possible discrimination betweenthe searched protein and the other candidate proteins. The algorithm limits the parameters withexploratory e�ects, while preserving enough sensitivity to be able to �nd most of the proteins.In that way, by strongly limiting the number of possible combinations arising from modi�cationand missed cleavage parameters, one can obtain a fast and highly discriminant search algorithm,which does not produce too many candidates. The speed of the algorithm is also essential whenit comes to automation of the process for large-scale identi�cation projects. Only one missedcleavage is allowed, as well as the following modi�cations: cysteine carboxymethylation, acrylamideadducts to cysteines and oxydized methionines. For these modi�cations, we permit only 0, 1 orall corresponding amino acids to be modi�ed, in order to avoid too many combinations. Thus,the database can be parsed linearly and digested on-line, which avoids the use of a voluminousmass index. To improve e�ciency, the database (SWISS-PROT and TrEMBL in FASTA format)have been split up into about 40 di�erent sections, each of which contains the sequences of speci�cspecies or taxonomic category. A species tree was built that allows to parse only the part of thedatabase corresponding to the user-speci�ed organism or range of organisms. Finally, we considerthe whole set of parameters with �ltering e�ects, in the hope to modulate their usage and thus toavoid the e�ects of �xed thresholds which too radically eliminate interesting candidate proteins.De�nition of the scoreThe main di�culties in the de�nition of a score calculation are to determine the most importantparameters, their relative weights and the way to integrate the whole set of parameters into thescore calculation. For this reason, we use the parameter levels de�ned above to determine a scorecalculation using their respective properties.Parameters of level 1, the mass level, serve to calculate a score of level 1 associated to eachmatching peptide. For a given protein, the contribution of the parameters of level 1 is the sumof the level 1 scores of its peptides. It can be seen as an extension of the notion of number ofmatches used by most of the existing identi�cation tools that count the number of experimentalmasses matching theoretical peptide masses of the candidate proteins. The more identi�ed masses aprotein has in the mass spectrum, the higher is the con�dence in its identi�cation. While tools suchas PeptIdent and PeptideSearch assign a weight of either 0 or 1 to each peptide mass, dependingon whether or not it is a match, our idea is that the weight associated to a peptide mass can bemodi�ed according to parameters of level 1. This gives an indication of the importance of a mass inthe score calculation. We use four parameters at this level: the number of chemical modi�cations,the number of missed cleavages, the intensity of the corresponding peak in the mass spectrum, andthe hydrophobicity coe�cient. Then we calculate the �rst part of our score (S1) by:

S1 = NXj=0 score1(j);where score1(a) = (coefm)nm(a) � (coefc)nc(a) � coefi(a) � coefh(a);where N is the number of matched peptides, score1(a) the score of level 1 associated to peptidea, coefm the modi�cation coe�cient, nm(a) the number of modi�cations in peptide a, coefc themissed cleavage coe�cient, nc(a) the number of missed cleavages in peptide a, coefi(a) the peakintensity coe�cient of peptide a, and coefh(a) the hydrophobicity coe�cient of peptide a. Inthis expression, the modi�cation and missed cleavage coe�cients are �xed for all peptides and allproteins. However, their importance is increased with the power of the number of modi�cations andmissed cleavages that are present in the peptide. coefh(a) is proportional to the hydrophobicity ofpeptide a (the weaker the hydrophobicity, the higher coefh(a)), while coefi(a) is proportional tothe peak intensity of peptide a (the higher the peak intensity, the higher coefi(a).Parameters of level 2 are used to compute coe�cients that are then applied to the previouslyde�ned score. Indeed, at level 2, the parameters concern the whole protein, so they have to directlymodify the value of the score associated to the protein. Four parameters are used at level 2: themolecular mass of the protein (Mr), its isoelectric point (pI), a coverage coe�cient (the percentageof the protein sequence covered by the matched peptides) and a standard deviation of the distancesbetween experimental and theoretical masses. The score of level 2 (S2) is calculated as:S2 = 1coefe � coefw � coefp � coefr;where coefe is the standard deviation coe�cient, coefw the molecular weight coe�cient, coefp theisoelectric point coe�cient and coefr the coverage coe�cient. The criterion for considering themass distance between experimental and theoretical masses for all matched masses is based on thefact that the more constant this distance is for all matched masses, the lesser is the likelihoodthat the matches happened randomly. This notion was re�ned to take into account the calibrationerror of the measuring device. Thus, we make a robust and iterative linear regression [19] upon allmatched masses, and eliminate the masses that are too far from the regression line (which are morelikely to be false matches). We then calculate the standard deviation of matched masses aroundthis line. This regression is iterative, as it is performed in several steps, each step eliminating themasses farthest from the regression straight line, the line then being recalculated based on the newset of masses. The iteration is stopped when no mass has been eliminated in the previous step,or when a given minimum number of masses is reached. The standard deviation calculated atthis last step gives a hint for the correspondence between the mass alignment and the supposedspectrometer error. Moreover, one can expect that the linear regression allows to compensate forsome calibration errors occurring during the peak detection, thus stabilizing the overall algorithm.Mr and pI coe�cients are non-linear. We de�ne several thresholds for the distance betweenexperimental and theoretical values of Mr and pI, and then associate a coe�cient to each of thesethresholds. The more the theoretical values move away from the experimental values, the weakeris the coe�cient. The coverage coe�cient is proportional to the percentage of the sequence that iscovered, therefore the higher the percentage, the higher the coe�cient.Finally, the total score associated to a protein is given by the expression:score = (S1)� � S2;

where � is a weight for the importance of parameters of level 1 against those of level 2. Parametersof level 3 have not yet been taken into account, but they will be used within the scope of the\molecular scanner" [23].The algorithmThe algorithm (1) used for the identi�cation can be summarized as follows in a pseudo-programming language.algorithm 1 identi�cationenter user data()result protein list = ;for all species 2 species list dofor all databases 2 databases list dofor all protein 2 protein list dofdigestion of the protein and creation of the list of matched peptidesgbuilt matched peptide list(peptide1 list)if number match > match thereshold thenfwe take out from the peptide list those with a mass too far from the standard deviationgregression(peptide1 list; peptide2 list)score calculation(protein; peptide2 list; score)add result protein list(protein; score)end ifend forend forend forsort result protein list(result protein list)2.5 The learningThe score calculation and the peak detection that we use involve many coe�cients (some requiringalso several thresholds) that are associated to the various parameters. These coe�cients determinethe relative importance of each parameter in the score calculation, in order to be able to discrim-inate at best the right protein from the other candidate proteins. We use a learning algorithmto determine the coe�cients and threshold values that allow for the best discrimination. For thisreason, the peak detection and identi�cation parts of the algorithm have been uni�ed to adjust allthe parameters involved in the whole process, from spectrum analysis to identi�cation. A geneticalgorithm [25] has been applied to a training set of already identi�ed proteins. This algorithmsearches for the best coe�cient values that allow for the identi�cation algorithm to identify theright protein, with its score being as distinct as possible from the scores of the following proteinsin the ranked list of candidate proteins.2.5.1 The genetic algorithmFor the learning phase, 36 variables have been de�ned, representing all the parameters and thresh-olds needed to calculate the score. Among these 36 parameters, 33 have real values and only 3have integer values. Therefore the parameters are coded as a vector (chromosome) of 36 real values(genes). We use a non-linear mutation operator [26] for genetic algorithms working with real values.This operator allows to decrease the mutation e�ect during generations, favoring the convergence

of values associated to genes. We use a classical crossover operator and a readjusting �tness func-tion operator [25], thus avoiding too fast a convergence of the algorithm. We have developed anextension to the classical genetic algorithm that uses two populations with di�erent convergencelevels, allowing to optimize the quality of the results. The population with high convergence levelcontains 26 chromosomes and the one with a weak convergence level contains 44 chromosomes,each of them representing a set of particular parameters. We de�ne a �tness function whose valuecharacterizes how well our scoring function can discriminate the right protein. For the parametervalues of each chromosome, we apply peak detection and identi�cation algorithms to a subset ofspectra from the training set. The results of these algorithms are then used to calculate the �tnessvalue associated to each chromosome as follows:value = ( score0score0+score1 if Rprot = prot00:5� (position(Rprot) � 0:05) else;where scorei is the score of the ith protein from the list of results, Rprot the name of thesearched protein, proti the name of the ith protein from the list of results, and position(Rprot) theposition of the right protein in the list. The total �tness of the chromosome is the average of thesevalues for the subset of spectra.3 Results and discussion3.1 Peak detectionFig. 5 shows a region of the SYP ECOLI spectrum with peaks and their �ts. Only a small fractionof the peaks could be interpreted as peptides of SYP ECOLI, the other peaks may be due toimpurities, protein fragments or modi�cations. It reveals that small peaks may be important foridentifying a protein.It is not apriori clear, whether the simple classi�er given by equation (3) is su�cient to separate'false' peaks from 'true' ones. Therefore we performed a peak detection for the ten spectra used fortesting the identi�cation algorithm (section 3.3) with threshold values of emax = 1 and hmin = 1,which were chosen to be little restrictive. We plotted lg(e) versus lg(hn) (Fig. 6). It yields thatthere is a strong overlap between the 'true' and the 'false' peaks and the classi�er (3) is not able toseparate all of them. But the values of emax and hmin given by the learning algorithm (section 3.2)indicate that it is more important for the identi�cation to consider all 'true' peaks, even if some'false' ones mixe in. Another result is the strong correlation between e and hn , i.e. the higher thepeaks the better the �t. This is mainly due to the fact that we normalized the error function (2)with respect to the height.3.2 The learning3.2.1 Preliminary studyIn order to get an idea of their in uence, a study has been carried out on the e�ect of the mainparameters upon the quality and speed of the identi�cation. Studies have been performed by others,but without the use of experimentation [27]). For our study, the PeptIdent tool was employed toidentify a set of 20 known proteins, each time varying the values of the available parameters. A�rst result showed the dominant importance of the �ltering parameters, especially the choice of aspeci�c species and to a lesser degree the information about molecular weight and isoelectric point.Without these parameters, the correct protein was often lost among a very large set of candidateproteins.

1180 1185 1190 1195 1200 1205 12100

1000

2000

3000

4000

5000

6000 1179.561

1185.574

1193.584 1201.5881207.533

m (M/Z)

Cou

nts

Figure 5: Detected peaks in the SYP ECOLI spectrum (dots) and their �ts (solid line). Thevalues of the error function e(m) were : e(1179:561) = 0:37 � 10�3, e(1185:574) = 0:45 � 10�3,e(1193:584) = 2:63 � 10�3, e(1201:588) = 1:68 � 10�3 and e(1207:533) = 2:53 � 10�3. Only the peaksat m = 1185:574 and m = 1207:533 match a peptide of SYP ECOLI considering only chemicalmodi�cations in PeptIdent and one missed cleavage.The analysis also highlighted the high e�ect of modi�cations and missed cleavage parametersupon the number of generated candidates. Indeed, PeptIdent allows to take into account annota-tions from SWISS-PROT entries and the chemical modi�cations that represent a huge combinatoryupon the number of di�erent masses that are possible for a single peptide mass. Therefore, thequality of the results is often deteriorated when one allows the whole set of modi�cations or, evenworse, if one allows one or two missed cleavages. The analysis showed that certain proteins (whenvery few peptides from the protein are found in the spectrum and when they are modi�ed orincompletely digested) cannot be found without the use of at least one of these parameters.3.2.2 Genetic algorithmWe have selected a set of 91 proteins with known identi�cation (identi�ed by with at least twomethods, including peptide mass �ngerprinting, microsequencing, gel matching and amino acidcomposition analysis) as training set. We have performed several learning phases, gradually in-creasing the number of parameters, the number of proteins in the training set, and varying certainparameters of the genetic algorithm that in uence its convergence level. Each application of thepeak detection and identi�cation algorithm takes about 1 minute (on an Ultra Sparc Station 5), soit is not possible to test the whole set of 91 spectra for each chromosome. Instead, we randomlychose 20 spectra for each chromosome and de�ne the �tness to be the average of their score. We canestimate the execution time of our learning algorithm for 100 generations : 100 � 70 � 20 = 140; 000minutes (about 100 days). Due to the time needed for a complete execution of the algorithm, wecurrently only have partial results. The version of our learning algorithm that uses the whole set of

0 0.5 1 1.5 2 2.5

−4

−3

−2

−1

0

a

lg(e

)

0 0.5 1 1.5 2 2.5

−4

−3

−2

−1

0

b

lg(e

)

0 0.5 1 1.5 2 2.5

−4

−3

−2

−1

0

c

lg(e

)

lg(h)Figure 6: lg(e) versus lg(hn ) for the peaks of the ten test spectra with emax = 1 and hmin = 1. a)lg(e) versus lg(hn) for the peaks that were considered as 'false' or at least uncertain after a visualexamination of the spectra. b) lg(e) versus lg(hn) for the peaks that were considered as 'true'. c)lg(e) versus lg(hn) for the peaks that matched a peptide mass of the corresponding protein. Thesolid lines indicate the values of emax = 0:5 and hmin = 2:2 obtained by the learning algorithm.36 parameters presented in this article and the 91 spectra of the data set was still running when thearticle was submitted. However, we present here the results obtained by the previous execution,using 29 parameters and 58 spectra during 24 generations [Note: the full results will be includedin the �nal paper].The parameters of this execution are: the two parameters of peak detection emax and hmin, theparameter calibFact which in uences the calibration weight in the peak detection,minMatch whichgives the minimum number of matches necessary to consider a protein as a candidate, deltaMass,the maximum allowed tolerance of masses, the two coe�cients coefReg1 and coefReg2 whichare the thresholds for eliminating masses with linear regression, the coe�cient coefMiss appliedto a missed cleavage, the coe�cient coefModif applied to a modi�cation, the four coe�cientsMWcoef1 to MWcoef4 applied to the deviation of molecular masses and associated to the fourthreshold parameters MWthres1 to MWthres4 for the deviation of molecular masses, the fourcoe�cients PIcoef1 to PIcoef4 applied to the deviation of isoelectric points and associated tothe �ve threshold parameters PIthres1 to PIthres5 for the deviation of isoelectric points, thetwo parameters nbMatchThres1 and nbMatchThres2 which determine the threshold at which theiteration on the linear regression is stopped, and power (�) the weight applied to the parametersof level 1 against those of level 2.Figure 7 shows the results obtained with these parameters. The �rst one, fitness, shows thealgorithm convergence with three curves. The lowest one corresponds to the average �tness of thepopulation with high convergence level, and the following one to the one with weak convergencelevel. We can see a very good convergence of the population with high convergence level, with amaximum average �tness of 0.8123 for 26 chromosomes, corresponding to 520 identi�cations. Theuppermost curve gives the value of the best chromosome for each generation, with a maximum at0.9315 (20 identi�cations). These chromosomes are used to determine the best parameters for the

identi�cation algorithm at the end of the learning step. This is why we present the evolution ofthe parameters of the best chromosomes in the following graphs. However, one should note thatthese results depend on the data used for the learning. The lack of variety in the data for a givenparameter can cause a bias in the obtained results. In the future it will be important to repeat thelearning with a larger set of data and as much diversity as possible.The parameters that present the clearest results are hmin, minMatch and coefMiss. The highconvergence of hmin shows that the compromise between the noise height and the minimal heightof peaks is essential for peak detection. The results about minMatch show very clearly that theexploratory e�ects of this parameter are more important than its �ltering e�ects, thus avoidingthe loss of small proteins among the candidate proteins. The very weak values of the coe�cientassociated to the missed cleavage prove that the combinatory due to the use of missed cleavagesimplies such a huge increase of false matches, that the weight associated to peptides with missedcleavages must be drastically reduced. This corresponds to cases with a very good digestion. Wecan also deduce that for an algorithm that does not incorporate a penalizing factor for missedcleavage, it is preferable, in case of good digestion, not to use the possibility to allows for missedcleavages at all.Some other results are also rather clear, such as the high value of deltaMass that gives animportant exploratory e�ect of level 1 in the matching of peptides that can be compensated forby the �ltering e�ect of the linear regression used only at level 2. The high values of parametercoefReg1 show that it is preferable to eliminate masses only when they are far enough from theline of the linear regression (level 1). In any case their weight is lowered by the value of thestandard deviation (level 2). The values of parameter power can imply that the weight to give tothe parameters of level 1 compared to those of level 2 must be higher than what was allowed inthis experience. A larger variation interval has therefore been permitted for this parameter in thenew experimentation currently under way. The high variation of values of the calibFact parametercon�rms the limited role of the calibration, due to the use of linear regression.Due to the small number of generations calculated, the variability of the other parameters cannotbe clearly explained yet. They may not have converged at time of writing, but one can probablysay that they are less discriminant than those presented before. One more global conclusion isthat the division of the score calculation into levels that allow to consider parameters at varioussteps of the search process is very important to resolve con icts between exploratory and �lteringe�ects. The exploratory e�ects can be e�ciently used (if they are not too costly, as it is the casefor missed cleavages), if later strong enough �ltering e�ects are present to compensate for theire�ects. Thus, we obtain a search algorithm that studies a maximum of candidate proteins, whilepreserving su�cient discrimination power to bring out the right protein.3.3 ComparisonWe have developed an identi�cation tool, PeptIdent2, based on the method presented in this paper.In order to validate the method, we have undertaken a comparative study of the quality of proteinidenti�cation obtained by several identi�cation tools. We compare the results of PeptIdent2 withthose of PeptIdent, Mowse, ProFound, PeptideSearch and MS-Fit (see section 1.2). For this, we takea set of ten mass spectra of proteins whose identi�cations have been con�rmed by microsequencing.For each protein, we show in Table 1 and 2 the identi�cation result for each of the algorithms.For each identi�cation we also give the score value (between parentheses) of the �rst candidateprotein followed by either the score value of the second candidate protein (if the �rst one is theright protein) or otherwise, the rank and, in parentheses, the score value of the right protein. Theright protein is always displayed in bold type. The notation Xex means that the score of thecorresponding protein is equal to that of X other proteins, the algorithm not being able to give a

clear discrimination. Finally, we note '-' if the right protein was not found amongst the �rst twentycandidate proteins.For all these experiments, the parameters used in all identi�cation programs were identical,when these parameters were available for each tool respectively. The selected species was mouse,the allowed Mr variability was �50%, the allowed pI variability was �1, the minimum numberof matched masses was 3, the maximal tolerance for masses was 0.3 dalton, at most one missedcleavage was allowed and the modi�cations taken into account were cysteine carboxymethylationand oxydized methionines. Table 1 gives \raw" results, that is without user interpretation. Inthis table the databases used were SWISS-PROT and TrEMBL for PeptIdent and PeptIdent2,SWISS-PROT for MS-Fit, OWL for Mowse, nrdb for PeptideSearch and NCBInr for ProFound.Table 2 gives the results after a �rst analysis by an expert user in our laboratory, in particular toremove the proteins with species that did not correspond to the search, as Mowse, ProFound andPeptideSearch do not allow to narrow down the search based on species. TrEMBL database wasalso removed for PeptIdent and PeptIdent2 tools to have a better comparison with MS-Fit whichcannot use TrEMBL.The �rst thing we notice is the good identi�cation obtained by PeptIdent2 in both tables. Inthe second table, the right protein was identi�ed in the �rst place in 9 out of 10 cases, and witha large score discrimination (at least �vefold) in 6 out of 10 cases. The only protein that was notcorrectly identi�ed was P10639 which ranked second in the list of results, with a score quite closeto the one of the �rst protein. No other identi�cation program correctly identi�ed this protein,except for PeptIdent which put it at the �rst place with six other proteins with identical score.PeptIdent globally allowed a good identi�cation when the TrEMBL database was not used, butwith a much less clear discrimination than PeptIdent2, and many proteins were attributed identicalscores. The other programs obtained rather variable results, the best being ProFound and MS-Fit.One can however notice that programs that do not allow to select a species (Mowse, ProFound etPeptideSearch) give result lists that are much larger, thus requiring a much larger manual analysistime to select the right protein in the list. Moreover, in this case, the risk is higher that the rightprotein does not appear at all in the list of results. One can also remark that programs that use,as their score, only the number of matched peptides (PeptIdent and PeptideSearch) have a muchweaker discrimination power than the others ones and �nd more often proteins with identical scores,making the results interpretation by the user more di�cult.We are now following this comparative study with a second one on a larger set of proteinsand with various species in order to obtain a better validation of the comparison. To preservea maximum reliability in the comparison results, we plan to use only experiments in which MSidenti�cation has been at least con�rmed by microsequencing.4 ConclusionProteins identi�cation and characterization is one of the most essential tasks performed in proteomeresearch. The currently most widely used identi�cation method compares the masses obtainedfrom an MS spectrum of an enzymatically digested protein with the theoretical masses of proteinscontained in an in silico digested protein sequence database. The precise determination of thepeptide masses in the spectra, and a highly discriminating mass comparison algorithm are thereforethe keys to the accurate identi�cation of proteins. We have developed a new tool to identify proteinsfrom their peptide mass �ngerprints. It comprises a fast and precise peak detection algorithm, aswell as a new mass comparison and identi�cation program, which is based on an advanced scoringmethod, both procedures being validated by an automatic learning algorithm. The analysis of thethresholds associated with the peak detection has revealed that it is preferable to be little selectivein the choice of peaks in the mass spectrum, in order to avoid the loss of apparently �ctitious

Table 1: Raw comparison of identi�cation toolsPeptIdent2 PeptIdent Mowse ProFound PeptideSearch MS-FitP56480 1(335:5)j2(9:38) 1(15)j2(7) 1(3:59+10)j8(1:42+7) 1(2:9�1)j5(8:1�4) 1(13)j3(12) 1(2:39+5)j2(126)P02088 1(205:2)j2(23:9) 1(22)j9(94ex) 1(1:5+6)j2(3:99+5) 1(5:0�12ex)j3(8:1�4) 1(72ex)j3(6) 1(4:83+6)j2(1:62+3)P01942 1(9:85)j2(2:76) � � 1(3:4�1)j3(1:7�1) � 1(65:3)j3(49:8)P17742 1(9:81)j2(0:29) 1(5)j5(43ex) � � 1(44ex)j5(3) 1(189)j2(19:9)P43024 1(16:21)j2(12:72) 1(18)j18(63ex) 1(3:67+5)j8(2:88+4) 1(3:4�1)j14(3:3�4) 1(58ex)j9(4) �P10639 1(2:31)j2(1:72) � � � � �P12787 1(41:54)j2(10:61) � � 1(4:9�12ex)j3(4:3�3) 1(8)j3(72ex) �P27773 1(209:69)j2(11:82) 1(122ex)j2(11) � 1(3:2�1)j4(1:6�1) 1(12)j2(11) 1(7:15+4)j2(394)P27773 1(557:6)j2(121:2) 1(27)j17(154ex) � 1(1:0)j2(1:9�5) � 1(5:81+4)j2(2:96+4)P38647 1(737:0)j2(84:62) 1(19)j5(132ex) 1(1:18+12)j4(1:58+11) 1(2:1�12ex)j3(1:9�1) � 1(1:99+6)j2(8:23+3)Table 2: Comparison of identi�cation tools after user analysisPeptIdent2 PeptIdent Mowse ProFound PeptideSearch MS-FitP56480 1(335:5)j2(9:38) 1(15)j2(6) 1(1:42+7)j2(9:2+4) 1(8:1�4)j2(1:1�21) 1(12)j2(5) 1(2:39+5)j2(126)P02088 1(205:2)j2(23:9) 1(10)j2(5) 1(1:5+6)j2(1:81+5) 1(5:0�12ex)j3(8:1�4) 1(72ex)j3(5) 1(4:83+6)j2(1:62+3)P01942 1(9:85)j2(1:11) 1(5)j2(44ex) � 1(3:4�1)j2(1:7�1) � 1(65:3)j3(49:8)P17742 1(9:81)j2(0:19) 1(4)j2(3) � � 1(44ex)j5(3) 1(189)j2(19:9)P43024 1(16:21)j2(12:72) 1(6)j2(4) 1(4:45+4)j5(2:88+4) 1(1:1�1)j12(3:3�4) 1(56ex)j9(4) �P10639 1(2:31)j2(1:72) 1(37ex) � � � �P12787 1(41:54)j2(10:61) 1(8)j2(7) � 1(4:9�12ex)j3(4:3�3) 1(7)j2(5) �P27773 1(209:69)j2(11:02ex) 1(12)j2(92ex) � 1(2:3�1)j2(1:6�1) 1(12)j2(11) 1(7:15+4)j2(394)P27773 1(557:6)j2(121:2) 1(143ex)j2(132ex) � 1(1:0)j2(1:9�5) � 1(5:81+4)j2(2:96+4)P38647 1(737:0)j2(23:02) 1(15)j2(105ex) 1(1:58+11) 1(2:1�12ex)j3(9:9�2) � 1(1:99+6)j2(8:23+3)peaks that might eventually appear to be useful, provided the identi�cation algorithm is able todiscriminate "false" peaks from real ones. Our identi�cation algorithm has proven to be robustenough in this respect. Also, the learning procedure has con�rmed the advantage of a scoringscheme based on the balance between exploratory e�ects and �ltering e�ects of parameters, and ontheir classi�cation into several levels. An evaluation of our new identi�cation tool was carried outwith a test set of known proteins, and showed that it yields a signi�cant gain in the discriminationof the correct protein, in comparison to other identi�cation algorithms.This work is now being extended by the development of a new version of the learning algorithmthat will be able to classify the proteins in the learning set simultaneously with the calculation ofthe parameter weights. This will determine several subsets of the parameter space, thus allowingfor an optimal discrimination of the scores. The goal is to determine several sets of parametersthat will optimally discriminate the scores, no longer for all proteins, but rather for one subset ofproteins that correspond to a speci�c value of one of the experimental parameters (species, Mr,pI, ...). Our score calculation will also be extended at the contextual level within the frame of thedevelopment of our molecular scanner. In addition, a new intermediate level, the "correlation level"will be introduced between the protein and the contextual level, that will consider information fromseveral experiments carried out with di�erent experimental conditions producing several �ngerprints

of the same sample. The correlation of these data will then validate the information obtained fromthe preceding levels. We thus expect to further improve the e�ciency of our protein identi�cationmethod.AcknowledgementThis work was supported by the Swiss National Fund for Scienti�c Research (grant 31-52974.97)and the Helmut Horten Foundation. The authers would like to thank Dr. Eva Jung for useful dis-cussions and Luisa Tonella, Gerald Rosselat, Salvo Paesano and AbderrahimKarmime for preparingthe samples and testing the software.References[1] M. R. Wilkins, J.-C. Sanchez, A. A. Gooley, R. D. Appel, I. Humphery-Smith, D. F.Hochstrasser, and K. L. Williams. Progress with proteome projects: why all proteins ex-pressed by a genome should be identi�ed and how to do it. Biotechnol. Gene. Eng. Rev.,13:19{50, 1995.[2] P. Edman and G. Begg. A protein sequenator. Eur. J. Biochem., 1:80{91, 1967.[3] M. Karas, U. Bahr, and U. Giessmann. Matrix assisted laser desorption ionization massspectrometry. Mass Spectrom. Rev., 10:335{357, 1991.[4] D. C. Muddiman, A. I. Gusev, and D. M. Hercules. Application of secundary ion and matrix-assisted laser desorption-ionization time-of- ight mass spectrometry for quantitative analysisof biological molecules. Mass Spectrom. Rev., 14:383{429, 1995.[5] J. B. Fenn, M. Mann, C. K. Meng, S. F. Wong, and C. M. Whitehouse. Electrospray ionization- principle and practice. Mass Spectrom. Rev., 9:37{70, 1990.[6] R. Cotter. Time-of-Flight Mass Spectrometry. ASC Professional Reference Books, 1997.[7] C. Koster, M. S. Kahr, J. A. Castero, and C. L. Wilkins. Mass Spectrom. Rev., 11:495, 1992b.[8] R. G. Cooks, S. H. Hoke, K. L. Morand, and S. A. Lammert. Int. J. Mass Spectrom. Ion.Proc., 118:1{36, 1992.[9] M. R. Wilkins, K. L. Williams, R. D. Appel, and D. F. Hochstrasser. Proteome Research: NewFrontiers in Functional Genomics. Springer, 1997.[10] M. R. Wilkins and A. A. Gooley. Proteome Research: New Frontiers in Functional Genomics,chapter 3, pages 35{61. Springer, 1997.[11] A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence data bank and its supplementTrEMBL in 1999. Nucleic Acids Res., 27:49{54, 1999.[12] D. J. C. Pappin, P. Hojrup, and A. J. Bleasby. Rapid identi�cation of proteins by peptide-mass�ngerprinting. Curr. Biol., 3:327{332, 1993.[13] W. Zhang and B. T. Chait. Protein identi�cation by database searching: A bayesian algorithm.In The 43rd ASMS Conference on Mass Spectrometry and Allied Topics, Atlanta, Georgia,1995.[14] G. H. Gonnet. A tutorial introduction to computational biochemistry using darwin. Technicalreport, E. T. H., Z�urich, Switzerland, November 1992.

[15] U.K. Laemmli. Nature, 277:680{685, 1970.[16] D. F. Hochstrasser, M. G. Harrington, A. C. Hochstrasser, and M. J. Miller. Anal. Biochem.,173:424{435, 1988.[17] R. J. Klimowski, R. Venkataraghavan, F. W. McLa�erty, and E. B. Delany. A small on-linecomputer system for high-resolution mass spectrometers. Organ. Mass Spectr., 4:17{39, 1970.[18] Y. N. Sukharev and Y. S. Nekrasov. The reduction of raw mass spectra to monoisotopic form.Organ. Mass Spectr., 11:1232{1238, 1976.[19] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C.Cambridge University Press, 1995.[20] B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996.[21] J. Canny. A computational approach to edge detection. IEEE Trans. Pattern Anal. MachineIntell., PAMI-8:679{697, 1986.[22] S. Gay, P.-A. Binz, D. F. Hochstrasser, and R. D. Appel. Modeling peptide mass �ngerprintingdata using the atomic composition of peptides. Same issue, 0-0-0, 1999.[23] P. A. Binz, M. M�uller, D. Walther, W. V. Bienvenut, R. Gras, C. Hoogland, G. Bouchet,E. Gasteiger, R. Fabbretti, S. Gay, P. Palagi, M. R. Wilkins, V. Rouge, L. Tonella, S. Paesano,G. Rossellat, A. Karmime, A. Bairoch, J. C. Sanchez, R. Appel, and D. F. Hochstrasser. Amolecular scanner to highly automate proteomic research and to display proteome images.Anal. Chem., in press.[24] J. Kyte and R. F. Doolittle. A simple method for displaying the hydropathic character of aprotein. J. Mol. Biol., 157(1):105{132, 1982.[25] D. E. Goldberg. Genetic Algorithm in Search, Optimization and Machine Learning. Addison-Wesley, 1989.[26] Z. Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs. Springer-Verlag,1996.[27] M. Mann. Microcharacterisation of Proteins, chapter VI.2. Wiley-VCH, 1994.

5 10 15 20

0.5

0.6

0.7

0.8

0.9

fitness

5 10 15 200

0.5

1emax

5 10 15 200

1

2

3

4

5hmin

5 10 15 200

0.5

1CalibFact

5 10 15 204

6

8

10

minMatch

5 10 15 20

0.1

0.2

0.3

0.4

0.5deltaMass

5 10 15 200

0.5

1

1.5

2coefReg1

5 10 15 200.5

1

1.5

2coefReg2

5 10 15 200

0.5

1

1.5

2coefmiss

5 10 15 200

0.5

1

1.5

2coefmodif

5 10 15 200.6

0.8

1

1.2

1.4MWcoef1

5 10 15 200.4

0.6

0.8

1MWcoef2

5 10 15 200.2

0.3

0.4

0.5

0.6MWcoef3

5 10 15 200

0.1

0.2

0.3

0.4MWcoef4

5 10 15 201

1.5

2MWthres1

5 10 15 201.5

2

2.5MWthres2

5 10 15 202

2.5

3MWthres3

5 10 15 202.5

3

3.5

4MWthres4

5 10 15 200.6

0.8

1

1.2

1.4PIcoef1

5 10 15 200.4

0.6

0.8

1PIcoef2

5 10 15 200.2

0.3

0.4

0.5

0.6PIcoef3

5 10 15 200

0.1

0.2

0.3

0.4PIcoef4

5 10 15 20

0.2

0.4

0.6

PIthres1

5 10 15 200.4

0.6

0.8

1

PIthres2

5 10 15 20

1

1.2

1.4

1.6PIthres3

1.4

1.6

1.8

2

PIthres4

2.2

2.4

2.6

2.8

PIthres5

4

5

6

7

nbMatchThres1

6

8

10

12

14

nbMatchThres2

0.5

1

1.5

2

2.5

3power

Figure 7: Learning of parameters. X axes correspond to the generations and y axes correspond tothe values of parameters.

Improving protein identification from peptide mass fingerprinting through a parameterized...

Documents

Transcript of Improving protein identification from peptide mass fingerprinting through a parameterized...