NITPICK: peak identification for mass spectrometry data

BMC Bioinformatics

Methodology articleNITPICK: peak identification for mass spectrometry dataBernhard Y Renard†1,2, Marc Kirchner†1,2, Hanno Steen3, Judith AJ Steen4

and Fred A Hamprecht*1,2

Address: 1Interdisciplinary Center for Scientific Computing, University of Heidelberg, Heidelberg, Germany, 2Department of Pathology,Children’s Hospital Boston, Boston, MA, USA, 3Department of Pathology, Harvard Medical School and Children’s Hospital Boston, Boston, MA,USA and 4Department of Neurobiology, Harvard Medical School and Department of Neurology, Children’s Hospital Boston, Boston, MA, USA

E-mail: Bernhard Y Renard - [email protected]; Marc Kirchner - [email protected];Hanno Steen - [email protected]; Judith AJ Steen - [email protected];Fred A Hamprecht* - [email protected]*Corresponding author †Equal contributors

Published: 28 August 2008 Received: 8 April 2008

BMC Bioinformatics 2008, 9:355 doi: 10.1186/1471-2105-9-355 Accepted: 28 August 2008

This article is available from: http://www.biomedcentral.com/1471-2105/9/355

© 2008 Renard et al; licensee BioMed Central Ltd.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background: The reliable extraction of features from mass spectra is a fundamental step in theautomated analysis of proteomic mass spectrometry (MS) experiments.

Results: This contribution proposes a sparse template regression approach to peak picking calledNITPICK. NITPICK is a Non-greedy, Iterative Template-based peak PICKer that deconvolvescomplex overlapping isotope distributions in multicomponent mass spectra. NITPICK is based onfractional averagine, a novel extension to Senko's well-known averagine model, and on a modifiedversion of sparse, non-negative least angle regression, for which a suitable, statistically motivatedearly stopping criterion has been derived. The strength of NITPICK is the deconvolution ofoverlapping mixture mass spectra.

Conclusion: Extensive comparative evaluation has been carried out and results are provided forsimulated and real-world data sets. NITPICK outperforms pepex, to date the only alternate,publicly available, non-greedy feature extraction routine. NITPICK is available as software packagefor the R programming language and can be downloaded from http://hci.iwr.uni-heidelberg.de/mip/proteomics/.

BackgroundThe reliable extraction of proteomic features fromcomplex biological mixtures is of utmost interest forunraveling the intricate biomolecular interplay at theheart of many systems biology research questions. In thiscontext, mass spectrometry (MS) has become a keytechnology which provides peptide and protein identi-fication, modification characterization and quantifica-tion capabilities. In contrast to gene expressionmicroarray technologies, MS analysis yields a directview on the whole set of proteins (the proteome) present

in the system under investigation and can thus con-tribute to a richer picture of protein interaction, real-timedynamics and their regulation [1]. MS contributes toclinical research and the diagnosis process [2], it is usedto detect, grade and characterize cancer diseases [3], itserves as a general purpose tool for microorganismcharacterization [4, 5] and provides sensitive and specificmeans for pharmaceutical quality control.

MS experiments typically contain tens to thousands ofspectra, each of which holds intensity information for

of 16(page number not for citation purposes)

BioMed Central

Open Access

http://creativecommons.org/licenses/by/2.0

http://hci.iwr.uni-heidelberg.de/mip/proteomics/


http://www.biomedcentral.com/

http://www.biomedcentral.com/info/about/charter/

tens to hundreds of thousands of mass channels. Thesedata stem from a set of different mass analysistechnologies, combining chemical separation procedures(chromatography), ionization methods (electrosprayionization, matrix-assisted laser desorption/ionization)and mass analyzers (time-of-flight, quadrupole, ioncyclotron motion). Despite physicochemical preproces-sing and the availability of high mass resolutioninstruments, spectra which stem from complex biochem-ical mixtures (e.g. cell lysate, blood or serum) frequentlyexhibit overlapping isotope distributions of independentmolecular species. Moreover, in many quantitative MSapproaches, these mixtures are present by design andtheir manual unmixing, quantification and interpreta-tion is tedious or unfeasible.

As a consequence, the automated analysis and inter-pretation of multicomponent mass spectra is highlydesirable. An (incomplete) set of challenges for MSfeature extraction includes the sheer data set sizes,mixtures of isotope patterns, the presence of multiplecharge states, chemical and detector noise, species-dependent ionization efficiencies, chemical reproduci-bility and deviations from detector linearity. Among allrequirements that derive from these challenges, it isimportant to emphasize the crucial nature of the featureextraction step: as all subsequent analysis steps rely onthe set of extracted features, meaningful biologicalconclusions are highly dependent on the adequacy andreliability of the feature extraction method.

Apart from few special alternate approaches [6, 7], allautomated methods for feature extraction from isotope-resolved mass spectra compare the observed (experi-mental) spectral pattern to a set of precalculatedtheoretical isotope patterns. The calculation of isotopepatterns is based on the estimation of average stoichio-metries for a particular molecular mass (averagine [8] andrelated methods [9]) or on relative isotope abundanceestimation [10] or on protein database-driven meanisotope distribution calculation [11]. The computationof isotope patterns is based on efficient implementations[12-14] of Yergey's original polynomial method [15, 16].

Comparison of theoretical and experimental isotopedistributions is typically accomplished based on sub-tractive fitting and peak selection algorithms, attemptingto sequentially detect the dominant components in amixture spectrum. These subset selection methodsattempt to determine a small set of basis functionscapable of approximating the observed signal well.Facing the infeasibility of an exhaustive search over allpossible subsets of explanatory basis functions, theyapply greedy search strategies. Here, "greediness" refersto the fact that these approaches consistently

overestimate individual feature contributions and areincapable of excluding a basis function once it has beenincluded in the active set. Hence, although providingsparseness, they are not globally optimal. In the contextof mixture modeling of mass spectra, these approachesamount to sequential isotope distribution templatematching procedures [6, 8-11, 17-22]. Fitting is carriedout via c2 distances [8, 20], least squares [9-11, 17,21-23], weighted least squares [19], or cross-correlation[18, 24]. The automatic determination of the charge stateassociated with an isotope pattern present in anexperimental spectrum is based on cross-correlation[19, 25] or on dot products in Fourier space [25, 26],exploiting the shift theorem of the Fourier transform.There are only few [27] non-greedy feature selectionalgorithms and mixture model approaches for MS data[28-31]. Among these, Matching[28] and Roussis'method [29] rely on manual preselection of contributioncandidates. Sparse non-greedy procedures include pepex[30] and Du's method [31]. The pepex approach issuitable for single charge data and is based on a non-negative sparse regression scheme, with an approximateL0-norm constraint. Du and Angeletti [31] perform datareduction prior to feature extraction and apply asparseness-promoting variable selection scheme [32].With the exception of Du's [31] and Kaur's [19]methods, none of the mentioned mixture modelapproaches provide support for the detection of a sparseset of a priori unknown peptide peaks under an arbitraryset of charge states. Du's method [31] and NITPICKovercome Kaur's greedy iterative weighted least squaresfitting approach. In contrast to [31], NITPICK does notrely on a heuristic parameterization and is instead basedon statistical model selection, making use of analgorithmically more efficient non-greedy sequentialfeature selection procedure with a statistically motivatedtermination criterion. NITPICK was designed to supportthe calculation of accurate monoisotopic peak lists fromraw mass spectra and was specifically tailored to caseswhere the raw spectra stem from unknown, possiblyoverlapping experimental isotope patterns of multiplecharge states.

The methods section details the mixture modelingapproach, fractional averagine for improved stoichiome-try estimation and data fitting, and our main contribu-tion, a computationally efficient method for improvednon-negative feature selection and the correspondingstatistical complexity estimation approach in conjunc-tion with the derivation of a lower bound for earlytermination. Comparative results on simulated and real-world data sets are given in the results and consequentlydiscussed. Eventually, we conclude and offer perspec-tives. Derivations of the formulas used in the mainarticle are available in the appendix.

BMC Bioinformatics 2008, 9:355 http://www.biomedcentral.com/1471-2105/9/355


MethodsThe NITPICK algorithm (cf. figure 1) models anobserved mixture spectrum as a linear combination oftheoretical isotope distribution patterns. Statistically,finding a sensible parameterization of this mixturemodel amounts to a constrained regression problem inwhich we seek to minimize the raw signal reconstructionerror in a least-squares sense while adhering to a set ofadditional constraints. Such an approach requires reli-able underlying isotope patterns, and we propose animprovement for the well-known averagine model toachieve this goal. We subsequently introduce NITPICK'siterative feature selection procedure, which employs anovel, non-greedy isotope distribution selection methodand is based on a statistically motivated terminationcriterion, attempting to eliminate premature or lateiteration termination.

Mixture modelWe assume that observed spectra are available in adiscrete (not necessarily equispaced) mass binning

scheme defined by a mass vector m= (m1, m2, ..., mN)T

and represent a raw multicomponent mass spectrum bya vector s of size N × 1, where si corresponds to theabundance observed in the ith mass bin mi. In practicalapplications, the vector s may also result from preproces-sing steps such as relevant region detection [19] and maythus represent only a part of a complete raw spectrum.The basic assumption behind the mixture modelapproach is that s be a linear combination of massspectrum abundances of K pure components ji,

s c= ==∑ ci i

k

K

f FF1

. (1)

Each of the concentration coefficients ci, i = 1, ..., K isassociated with a column ji of the N × K model matrixF. We regard these columns as basis functions and theirelements jji correspond to the mass spectrum abundanceexpected in the jth mass bin mj of the ith purecomponent ji.

For the estimation of the concentration vector c, the modelmatrix F has to be available, and in general this is not thecase. One hence resorts to approximating the basisfunctions by a large set of theoretical isotope distributions(i.e. isotope abundance patterns) densely spread over theprespecified mass/charge binning scheme. Effectively, thisrecasts the original peak picking task into the framework ofa feature (i.e. basis function) selection problem.

Model matrix calculationGiven an elemental stoichiometry, the correspondingtheoretical isotope distribution is well-defined and caneasily be calculated [12-15]. Hence, if a prespecified setof stoichiometries of potential pure components isavailable, the calculation of the respective set oftheoretical isotope distributions (including chemicalmodifications and multiple charge states) is straightfor-ward. These isotope distributions are subsequentlyconvolved with instrument-specific, possibly mass-dependent peak shape functions, yielding the basisfunctions ji.

Fractional averagineIn many practical applications prior knowledge aboutpotential components is not at hand. Thus, one needs toresort to expected average stoichiometry estimates as abest-effort approximation. In this case, the quality of thefeature selection procedure is highly dependent on thequality of the stoichiometry model. We thereforeextended the widely used averagine approach [8] toamend its discrete and discontinuous nature, gainingmodels without mass errors and improved true isotopedistribution reconstruction properties. Fractional

Figure 1NITPICK workflow overview. Raw spectrumpreprocessing, relevant region detection, region-wise peakpicking, merging of detected peaks and peak listpostprocessing. At the heart of the method lies an iterativefeature selection procedure controlled by a statisticaltermination criterion, as illustrated by the large box in thecenter. As a tightly interconnected prerequisite to the mainworkflow, the column on the left depicts the steps requiredfor the calculation of the regression model matrix. Numbersin parentheses give the manuscript sections in which thespecific steps are detailed.



averagine provides a real-valued element stoichiometryr = (r1,...,r5)T according to the mapping f: R Æ R5

between a mass value and the number of element atomsin an averagine (H7.75833C4.9384N1.35777O1.4773S0.0417)molecule. The calculation of the theoretical isotopedistribution of r is based on the observation that isotopeabundances follow a multinomial distribution [33], andthat fractional numbers of trials in a multinomial can bemodeled as linear interpolation between the probabilityfunctions of the multinomials parameterized with thesurrounding integers (see appendix A). For computa-tional ease, calculations are carried out in the realm ofthe corresponding moment generating function (MGF)[34] of the multinomial probability mass function. Forthe ith stoichiometry element, the MGF given ri can befactorized according to

M t t

p e p e p

x k i

tk

tk

ki i i

1 1

1 11 1

, , |… −

−[ ]+ −⎢⎣ ⎥⎦( )

( )= + + +⎡⎣

⎤⎦

=

−

rr r r

MM t t M t tx k i x k i i1 21 1 1 1, , | , , |… …− −⎢⎣ ⎥⎦( ) − ⎢⎣ ⎥⎦( )( )r r r

(2)

where pl is the probability of occurrence of the lthisotope pll

k ==∑ 11

, x = (x1, ..., xk)T denotes the number

of times a particular isotope is chosen xl il

k ==∑ r1

andt = (t1, ..., tk)

T is the corresponding variable of the MGF.By rearrangement of the MGFs of all elements, it ispossible to separate integer and real-valued contribu-tions, yielding the common averagine model rr = (Îr1˚,Îr2˚, ..., Îr5˚)T for the integers and the fractionalaveragine correction rr = (r1 - Îr1˚, r2 - Îr2˚, ..., r5 -Îr5˚)T for the remaining fractional masses. The theore-tical isotope distribution for r i is given by the linearcombination of a peak of intensity one at mass zero andthe theoretical isotope distribution of the ith averagineelement, weighted by 1 - r i and r i , respectively.Thus, efficient calculation of the theoretical isotopedistribution of the stoichiometry rr is carried out basedon the Mercury7 algorithm [14], and the theoreticalisotope distribution for the fractional stoichiometry r issubsequently obtained with five additional convolutionsteps.

Basis function selectionGiven the set of basis functions F = [j1j2 ... jk], basisfunction selection and subsequent determination of thecontribution coefficients ci provides a solution to eq. (1).Thus, as the modeling parameters and, in particular, themonoisotopic masses for all basis function are known,one can determine which isotope distributions arepresent and in what abundance (assuming ∑kjki = 1).

In practice, basis functions are calculated for eachpossible monoisotopic mass and each expected chargestate, yielding model matrices F with KN (in the case ofone basis function per mass/charge bin and charge, wehave K = nZN, where nZ corresponds to the number ofcharge states observable in the experiment; hence, fornZ > 1, there exists an infinite number of solutions foreq. (1)). This is a problem intrinsic to the proposedmixture modeling approach and has been observedpreviously [23, 28, 30]. The least absolute shrinkage andselection operator (LASSO) [32] enjoys favorable proper-ties of regularization and subset selection. Because theLASSO is capable of shrinking coefficients to exactlyzero, it offers a non-greedy way to gain sparse models.The LASSO solution c for equation (1) is given by

ˆ arg min{|| || }

. . | | ,

c s cc

= −

≤=∑

FF 2

1

s t c ti

i

K (3)

where t ≥ 0 is a user-defined tuning parameter [31, 32].Mass spectra intensities si, basis function values jji, andbasis function contributions ck are strictly non-negative,thus adding a non-negativity constraint to the solutionspace of c , yielding

ˆ arg min{|| || }

. . | | , .

c s cc

= −

≤ ≥=∑

FF 2

1

0s t c t ci i

i

K (4)

For fixed t, this is a quadratic programming problem withlinear inequality constraints which can be solved by anactive set algorithm, sequentially introducing the inequal-ity constraints and seeking a feasible solution satisfyingthe Kuhn-Tucker conditions [32, 35, 36]. Equation (4)corresponds to ˆ( ) arg min {|| || | |}c s ccl l= − + =∑FF 2

1cii

K

with ci ≥ 0 where the parameter t is related to theLagrangian multiplier l which determines the number offree parameters df(l) in the linear model [32, 36-38].

Common procedures for the optimal selection of l ordf(l) are based on the minimization of the predictionerror. This involves estimation of training optimism viaCp-statistics, the Akaike Information Criterion (AIC), orthe Bayesian Information Criterion (BIC) [37]. Alterna-tively, direct estimation of prediction error can be carriedout via cross-validation or generalized cross-validation(GCV) [37]. All these methods require the LASSOtrace c (ll), where ll Œ L and = { ,..., }| |l l1 definesthe set of LASSO regularization parameters for which theprediction error is calculated. In general, the calculationof the LASSO trace is computationally intensive and it is



not clear how the elements of L should be selected [36].Least angle regression (LARS) [39] is an algorithmicallydifferent approach to variable selection which can bemodified such that the LARS algorithm implements thenon-negative LASSO from equation (4). The LASSO-modified LARS is a constructive active set procedure whichconstructs the LASSO regularization path in a stepwisemanner. Denote by (l) the set of indices i Œ {1. ..., K} ofthose ji which are in the active set for a particular choice ofl. Starting from l = ∞ and letting l Æ 0, the algorithmcomputes non-negative LASSO solutions for all l for whichthe active set changes, thus implicitly defining L. TheLASSO-modified LARS guarantees (lj) ≠ (lj+1), but itallows for the deletion of previously selected basisfunctions, and hence | (lj) | need not increase mono-tonically for increasing j. Basis functions can be required toenter the active set in their predefined directions [39] whichallows the implementation of a non-negativity constraint.Necessary matrix inversions are constrained to | (l) | ×| (l)|-sized scatter matrices FF FF ( ) ( )l l

T and can beimplemented as iterative updates, thus the procedure iscomputationally efficient.

Complexity estimationIt is desirable to terminate active set updates as soon as thebasis functions in the active set are able to explain theobserved data sufficiently well, i.e. until the increase inexplanatory power does not justify the increase in modelcomplexity anymore. We now describe amodification to thenon-negative LASSO-modified LARS, which enables us tosequentially build a BIC trace along the LASSO regulariza-tion path and to identify minima along this trace. Upontermination, the proposed procedure returns the estimatec and the set = >{ | }i

ic 0 of active basis functions.

BIC measureThe LARSCp-type risk reestimation formula [39] for optimalselection of l does not hold under the non-negative LASSOmodification. Instead, we recalculate a BIC measure

BIC

MSE

( ) || || ( ) log ,( ) ( )

. ( )

ls

ll l

l

= − +12

2s cFF q

N

df N (5)

in each LARS iteration [40]. For the calculation of theunbiased training error MSE(l) in eq. (5) we require anadditional non-negative least squares fit

ˆ arg min || ||

. . ˆ .

( ) ( ) ( )

( )

( )

c s cc

l l l

l

l

q

q

ic

= −

( ) ≥

FF 2

0s t(6)

The noise variance s2 in eq. (5) is estimated as the meanresidual sum of squares of a low-bias non-negative leastsquares estimate [37].

Estimation of df(l)The calculation of BIC(l) in eq.(5) requires an estimatefor the degrees of freedom df(l), which can be obtainedvia the generalized degrees of freedom (GDF) [38]. TheGDF of an NN-LASSO-modified LARS model based onan active set (l) are given by

GDF( ) .( ) ( )ls

l l= 12s cT qFF (7)

Because the coefficients ˆ ( )c lq

i( ) > 0 are non-negative,

the estimate ˆ ( )c lq solves

ˆ arg min || ||( )ˆ

( ) ( )( )

c s cc

l l ll

lq q q

ii

K

qc= − + ( )⎧

⎨⎪

⎩⎪

⎫⎬

=∑FF 2

1

⎪⎪

⎭⎪(8)

which is differentiable with respect to ˆ ( )c lq

i( ) . Setting

the derivative to zero, we obtain

ˆ ( ) ( ).( ) ( ) ( ) ( ) ( )c s l l l l llq T T= −−FF FF FF1 12

1 (9)

Hence, given an active set (l), the generalized degreesof freedom from eq. (7) can be written as

GDF( )

( )

( ) (

( ) ( )

( ) ( ) ( ) ( )

l

s

s

l l

l l l l

=

= −

s c

s

T q

T T

12

12

1

FF

FF FF FF FF

TT s − 1

2l l1( ))

(10)

which is monotonously increasing for decreasing l (seeappendix B for a proof).

Optimal terminationThe minimal possible training error of the model isattained when all variables are in the active set, in whichcase the respective coefficients c q are given by c q = argminc ||s - Fc||2 subject to ci ≥ 0, and the correspondingerror is MSE = −1 2

Nq|| ||s cFF . Thus, a lower bound for

BIC(l) is given by

BIC MSE GDFminN

N( ) ( ) logls

l= +2 (11)

(see appendix C for a proof). In general, BIC(l) will haveseveral minima for increasing values of GDF(l), hencewe track the minimum BIC(lmin) through the NN-LASSO-modified LARS cycles and accept lmin as aminimizer as soon as the lower bound BICmin(l) of asubsequent LARS step exceeds the current best estimateBIC(lmin), i.e. BIC(lmin) < BICmin(l) (see figure 2).



Regression on selected modelsThe sum constraint in equation (4) is ultimatelyresponsible for the sparseness property of the LASSO.Its regularizing effect is similar to the one of theregularization term found in ridge regression, especiallywith respect to the fact that all LASSO estimates c i , i =1, ..., K are subject to shrinkage [32, 37] and representbiased versions of the least squares estimates. Given anactive set , the shrinkage bias on the c i can effectivelybe removed by introducing a subsequent non-negativeleast squares regression step after the basis functionshave been selected by the LASSO procedure [32]. Thisalso holds true for the NN-LASSO-modified LARSprocedure, and the corresponding unbiased quanti-fication estimate c

q is given by equation (6) with (l) = .

PostprocessingThe estimate c is subject to modeling errors and theseshortcomings lead to suboptimal NN-LASSO-modifiedLARS estimates and active sets. In particular, theestimation depends on the match between the observedand theoretical peak shape function. Especially in highmass resolution experiments, one can frequently observe

spurious peak detections in bins directly adjacent tomonoisotopic mass bins of true peaks [30]. A possibleremedy is a local maximum detection implemented as apostprocessing filter �(·) applied to the active basisfunction index set :

′=

= ∈ = ∈

j

n

( | )

{ |( ) max{( ) | ( )}}

G

j c c l jqj

ql G

(12)

where n G k jj k b b G( ) { || | }= ∈ − ≤ − 12 defines an m/z-

neighborhood of size G around each peak and bj is themass/charge bin index of the monoisotopic mass m0 ofthe jth theoretical isotope distribution jj. If ≠ ′ ,ˆ ( )c lq is reestimated using eq. (6) with ( )l = ′

ResultsStoichiometry modelsThe fractional averagine stoichiometry model wascompared against the classical averagine model basedon the analysis of their respective approximation errorsusing simulated theoretical peptide isotope distribu-tions.

Data SetAll UniProt (version 51.4.) [41] human proteins weresubjected to in silico tryptic digestion. For each of the Rdigestion product peptides r r R, { ,...,∈ 1 } , exact ele-ment stoichiometries rrr

x and exact theoretical isotopedistributions dr

x were calculated. Peptides with mono-isotopic masses above m/z 5000 were discarded.

Comparison of deviationsClassical and fractional averagine were used to estimateapproximate element stoichiometries rrr and rrr ,respectively, for all peptides r in the data set. Basedon rrr and rrr , the corresponding theoretical isotopedistribution intensity vectors dr

and dr were calculated.Figure 3 shows the cumulative distribution of thesquared differences between the classical averagine andthe true theoretical isotope distribution intensity vectors(|| ||d dr r

x− 22 , dashed black), and fractional averagine

and the true theoretical isotope distribution intensityvectors (|| ||d dr r

x− 22 , solid red).

Peak pickingFor peak picking/feature extraction performance evalua-tion, we determine representative peak picking statistics:we calculate accuracy, sensitivity, specificity, and positiveand negative predictive values on simulation data.Further, and in contrast to previous contributions, weexplicitly perform manual validation on a real-worlddata set.

2 4 6 8 10 12 140

100

200

300

400

500

600

700

800

LARS step/number of mixture components

BIC(λ)GDF(λ)MSE(λ), scaledBICmin(λ)

Figure 2Efficient automated determination of the number ofcomponents in an area with overlapping peaks usingthe BICmin(l) termination criterion. The mean squarederror MSE (scaled, dotted) decreases monotonically over theLARS steps and the generalized degrees of freedom GDF(l)(dashed) increase monotonically. The resulting BIC(l)measure (solid) exhibits a minimum BIC(l9) in the 9th LARSstep and l9 is accepted as a minimizer because the lowerbound BICmin(l10) exceeds BIC(l9) in the 10th LARS step.



Data setsSimulation data setsFor the simulation, all UniProt (version 51.4.) [41]human protein sequences were subjected to in silicotryptic digestion. Simulation sets were generated byrandom drawing of digestion product peptides andintensities. To ensure a fair comparison with the pepexprocedure (which was selected for benchmarking as theonly publicly available procedure implementing non-greedy feature extraction) which is limited to singlycharged data sets, all simulated peptide were endowedwith a single charge. Mercury7 [14] was used for thecalculation of the respective theoretical isotopic distribu-tions. After convolution with an m/z-dependent Gaus-sian aperture function [42], intensity-weighted linearcombinations of peptide spectra were calculated and aPoisson noise model (see appendix D) was applied toobtain spectra of different signal to noise (SNR) ratios.Simulations were performed in the densely populatedm/z 500–700 range (see Additional file 1 for the data sets).

Real-world data setExperiments on real-world data were performed usingBovine Serum Albumin (BSA) LC/(ESI-)MS calibrationdata. The data set was acquired on a QSTAR XL massspectrometer (Applied Biosystems/MDS Sciex) equippedwith microsale capillary HPLC system (Famos Autosampler,LCpackings, Agilent 1100HPLCpump). Amixture spectrumwithmany overlapping peaks was obtained by integration ofthe LC/MS data set over the retention time domain (seeAdditional files 2 and 3). Peak identification was carried outin the m/z 500–700 range and peak shape functions weremodeled according to mass-dependent Gaussian distribu-tions with standard deviations s(m/z) = 0.005 m/z [42].

Performance estimationWe characterize peak picking performance based on a setof measures from statistical test theory, all of whichdepend on the availability of the numbers of truepositives (TP), true negatives (TN), false positives (FP)and false negatives (FN).

Ground truth is based on knowledge of the complete setof peptide signals present in a mass spectrum. Forsimulated data sets, this information is available. In real-world experiments, the definition of ground truth iscomplicated by sample complexity, stochastic samplemodification, non-peptidic components and limiteddynamic range. As a consequence, TNs, FNs and theoverall number of true peaks are not available for real-world data, limiting the available statistical measures topositive predictive values and the ratio of true positives(sensitivity ratios).

Nevertheless, we can determine the number of TPs andFPs in both cases: we check whether a detected peakreally exists and if it has been assigned its correctmonoisotopic mass m0 and charge z. If so, it is countedas true positive (TP) or, otherwise, as false positive (FP).

Simulation dataAs the complete set of simulated peaks is known, theremaining set of undetected peaks can be determined andits members are counted as false negatives (FN). With thetrue number of positives and negatives available thecalculation of the number of true negatives (TN) isstraightforward, thus enabling the use of related statisticaltest error measures for performance characterization:

• accuracy (ACC) measures the rate of correct peak vs. nopeak decisions, i.e. ACC TP TN

TN FP TP FN= ++ + +

• the negative predictive value (NPV) gives the rate atwhich there is no peak at positions where the procedurewas unable to find a peak, NPV TN

TN FN= +

• the positive predictive value (PPV) measures the rate ofcorrect peak detections among all peaks detected by theprocedure, PPV TP

TP FP= +

• sensitivity (SE) measures the method's ability to detecta peak if it exists, SE TP

TP FN= +

• specificity (SP) measures the method's ability tocorrectly identify the absence of peaks in the spectrum,SP TN

TN FP= +

All measures have been computed with and without theapplication of postprocessing.

0.000 0.001 0.002 0.003 0.004 0.005

0.0

0.2

0.4

0.6

0.8

1.0

squared distance from real distribution

cum

ulat

ive

freq

uenc

y

Senko's AveragineFractional Averagine

Figure 3Comparison of the impact of averagine and fractionalaveragine stoichiometry estimation errors on theestimation of theoretical isotope distributions. Thecumulative histograms of least squares deviations from the truetheoretical isotope distribution illustrate the superior overallperformance of fractional averagine (solid line) compared toSenko's classical averagine (dashed line): fractional averaginecauses a 17% decrease in mean squared error magnitude.



Real-world dataResorting to LC/MS data and creating a semi-artificialdata set by integration over the retention time domainwas motivated by the fact that this approach yields a dataset accessible to human manual validation. With LCresolution power available to the human expert (andresorting to comparatively simple mixtures), all peaksdetected in the integrated mixture can still be manuallyverified. Exemplary peak picking results are illustratedbelow.

Comparative resultsPepexWe chose to compare NITPICK to a conceptually similar,model-based approach called pepex [30]. In contrast tomodel-free approaches and in accordance with NITPICK,pepex models observed spectra based on a linear mixturemodel, which is augmented by a complexity constraint.It uses the averagine model to describe unknown featuresand is capable of terminating its feature selection routineafter a sufficient number of basis functions has beenselected. However, as the publicly available implementa-tion of the pepex approach is limited to charge state z = 1data sets, NITPICK comparison against pepex waslimited to the simulated data set.

For the analysis, the pepex algorithm was tailored to theproblem at hand: its parameters were heavily optimizedto maximize peak picking performance on the simula-tion data set. As a consequence, the reported resultsunderestimate the pepex generalization error and over-estimate its performance (see Additional file 4). ForNITPICK, no specific parameter optimization was carriedout, postprocessing was kept to a minimum (G = 3), andthe reported results are representative (see Additionalfile 5).

MarkerViewWe also compared NITPICK's ability to extract peakinformation from a retention time integrated mixturespectrum against the proprietary MarkerView application(Applied Biosystems/MDS Sciex, Concord, Canada)version 1.2, which includes an LC/MS peak pickingalgorithm. In contrast to NITPICK, MarkerView wasprovided with the original LC/MS data set and thus hadretention time information available. Peak picking wascarried out in the m/z 400–1400 range and detectedpeaks were manually validated (see Additional files 6and 7).

DiscussionStoichiometry modelsIn comparison (see figure 3, classical averagine in dashedblack, fractional averagine in solid red), Senko's classical

averagine [25] features a larger number of very smalldeviances from the truth than fractional averagine. Thisis caused by the rounding to integers property of theclassical approach, yielding exact models more often. Atthe same time, the deviance distribution of the fractionalaveragine model has a significantly lighter tail, i.e. themodel generates significantly less stoichiometries whosetheoretical isotope distributions have large deviations.The cumulative distribution based on the fractionalaveragine model approaches 1 more quickly, and its useyields an overall decrease in theoretical isotope distribu-tion deviations. This finding is supported by thecorresponding one-sided non-parametric Mann-Whitneytest (p < 2.6 × 10-11). Because the overall impact on thepeak picking performance depends on the squared meanerror magnitude (7.6 × 10-4 for classical averagine, 6.3 ×10-4 for fractional averagine, corresponding to a 17%decrease for fractional averagine), fractional averagineclearly is the preferable model.

Peak pickingSimulation data setFigure 4 shows the results for the peak detectionperformance analysis. As expected, ACC, NPV and PPVimprove with increasing SNR. Postprocessing causes adecrease in NPV and an increase in PPV for all SNR levelsas the removal of spurious peaks decreases FP but also,erroneously, increases FN. The ACC plot (top left)illustrates the fact that NITPICK is successful at simulta-neously maximizing PPV and NPV. Postprocessing canthen be used to trade specificity for sensitivity assupported by the sensitivity-specificity trace in figure 4(bottom right). Here, each dot marks sensitivity andspecificity of a given NITPICK postprocessing parameter-ization. Lines connect points of different SNRs. Asexpected, the introduction of a postprocessing stepincreases specificity and decreases sensitivity. Furtheranalysis of FNs in the simulated data reveals that falsenegatives are predominantly due to low-intensity com-ponents in complex mixtures (data not shown).

Comparative resultsIn comparison with pepex, NITPICK exhibits betterresults with respect to all statistical measures in figure 4.It is especially obvious that pepex suffers from a severeincrease in false positives (FPs) for very low SNRsituations, yielding significant decreases in accuracy(ACC) and specificity (SP). For PPV, although thepepex approach outperforms NITPICK when no post-processing is applied, it is inferior to the full NITPICKalgorithm with simple spurious peak removal corre-sponding to eq. (12). With respect to sensitivity (SE) andspecificity (SP), figure 4 reveals constant high (above0.99) and superior specificity values for NITPICK at



greatly increased sensitivity. Thus one can conclude thatthe NITPICK algorithm is more sensitive than pepexand, at the same time, provides picked peaks with higherconfidence.

Real-world data setWe give peak picking illustrations for the mass rangesm/z 507–525 (with a zoom on m/z 518–525), m/z 636–646, m/z 695–725 and m/z 775–782, detailing positiveand negative peak picking performance aspects. In them/z 507–525 mass range (figure 5), all picked peakscould be verified, including the monoisotopic masses ofthe mixture distribution with components located at m/z523.23 (z=3) and m/z 523.82 (z=5). Upon re-examina-tion of the raw data, we detected a missed low-intensitypeak at m/z 515.76.

Figure 6 zooms onto two cases of overlapping isotopedistributions in the m/z 518–525 mass range. At m/z518.22 and m/z 519.11 NITPICK resolves two distinctmonoisotopic masses, in spite of their unfavorable massdistance. Although the second isotope peak of thedoubly charged ion with monoisotopic mass m/z

518.22 exhibits a heavy overlap with the monoisotopicpeak of the ion at m/z 519.11, NITPICK is still able tocorrectly detect the monoisotopic peaks of the twoisotope distributions. NITPICK also separates twoisotope distributions located at m/z 523.23 (z=3) andm/z 523.82 (z=4). The detection of the monoisotopicmass at m/z 523.82 is particularly non-trivial because ofits heavy overlap with an isotope peak of the isotopedistribution located at m/z 523.23 and also because thedetected monoisotopic mass peak at m/z 523.82 is notthe most abundant peak within its isotope distribution.

In the m/z 636–646 mass range (figure 7) we observe anexample of incomplete unmixing: the isotope distribu-tion (z=3) with monoisotopic mass located at m/z636.29 heavily overlaps the distribution (z=3) locatedat m/z 636.64 (left triangle marker). The overlap provesinseparable and the monoisotopic mass of the seconddistribution is wrongly detected at m/z 636.96. Further,due to conservative noise level/complexity estimation,the isotope distribution located at m/z 642.33 (righttriangle marker) is not detected. Note that in both of thecorrectly detected distributions located at m/z 636.29and m/z 639.65, the monoisotopic mass peak does notcorrespond to the most prominent peak.

In the m/z 695–725 mass range (figure 8), with oneexception, all detected peaks could be verified. Thewrongly detected peak at m/z 714.29 corresponds to thefirst isotope peak of the isotope distribution located atm/z 713.78 (z = 2). Especially in the m/z 718 to m/z 724region the algorithm proves capable of resolving non-trivial low-intensity mixtures.

SNR

accu

racy

5 10 25 50 100

0.94

0.95

0.96

0.97

0.98

0.99

1.00

NITPICK w/o postprocessingNITPICK w/ postprocessingpepex (optimized)

SNR

nega

tive

pred

ictiv

e va

lue

5 10 25 50 100

0.99

20.

994

0.99

60.

998

1.00

0


SNR

posi

tive

pred

ictiv

e va

lue

5 10 25 50 100

0.2

0.4

0.6

0.8


0.05 0.10 0.15 0.20 0.25 0.30 0.35

0.95

0.96

0.97

0.98

0.99

1−sensitivity

spec

ifici

ty

510

2550

100 5102550100

5

10

2550100


Figure 4Evaluation and comparison with the pepex algorithmon simulated data. Accuracy (top left), negative predictivevalues (top right), positive predictive values (bottom left) andsensitivity-specificity traces (bottom right). Plots showNITPICK results in solid red, NITPICK results withoutpostprocessing in dashed blue and pepex results (optimized,see text) in dashed-dotted black. NITPICK is clearly superiorin terms of accuracy, specificity and sensitivity.

507 512 517 522m/z

inte

nsity

(a.

u.)

507.

8

510.

25

512.

24

518.

2251

9.11

521.

22

523.

2352

3.82

true positive false positive false negative

Figure 5Peak picking in them/z 507-525mass range. Illustration ofobserved (top) and reconstructed (bottom) spectra. Alldetected peaks could be confirmed, including the monoisotopicmasses of the mixture distribution with components located atm/z 523.23(z=3) and m/z 523.82 (z=5).



In them/z 775–782 range (figure 9), the separation of twoheavily overlapping isotope distribution clearly illustratesthe benefits of NITPICK's intensity model-basedapproach to the peak picking/feature extraction problem:the second isotope peak of the isotope distributionlocated at m/z 779.32 (z=2) and the monoisotopic peakof the distribution located at m/z 780.35 (z=2) overlapcompletely and can only be distinguished by takingintensity information into account.

Overall, the results obtained on real-world data are inagreement with simulation results: after manual valida-tion of 192 peaks detected in the real-world dataset, weobserve 127 true positives, yielding a positive predictivevalue of PPV = 66.15%.

Comparison with MarkerViewOn the BSA data set, MarkerView detected 388 peaks, for96 (24.7%) of which charge state information wasavailable. Peaks without charge state assignment werecounted as true peaks if their detected mass/charge ratiowas correct. This resulted in 205 true positives for 82(40.0%) of which charge state information was avail-able. In comparison to NITPICK, this yields a sensitivity

ratio of SER SESE= = =Mar View

NITPICKker .205

127 1 61 and a positive

predictive value of PPV = 0.53.

As expected, with retention time information available,MarkerView manages to detect a significantly largernumber of peaks. Surprisingly, though, retention timeinformation did not contribute to an increased PPV. Thepartial lack of charge state information also caused theperformance interpretation to favor MarkerView: forpeaks with correct mass/charge ratio, we assumedcompletely error-free charge state assignments, which is

518 520 522 524m/z

inte

nsity

(a.

u.)

518.

22

519.

11

521.

22

523.

23

523.

82


Figure 6Zoom on the m/z 518–525 mass range. NITPICKproves capable of resolving overlapping isotope distributionsand assigning correct monoisotopic masses for thedistributions located at m/z 518.22 and 519.11 andat m/z 523.23 and 523.82. See text for details.

637 641 645m/z

inte

nsity

(a.

u.) 63

6.29

636.

96 639.

65 644.

25


Figure 7Peak picking results in the m/z 636–646 mass range.Illustration of observed (top) and reconstructed (bottom)spectra. At m/z 636.64 and m/z 636.96 we observe incompleteunmixing: The isotope distribution (z=3) with monoisotopicmass m0 located at m/z 636.29 heavily overlaps the distribution(z=3) with m0 = 636.64 m/z (left triangle marker). The overlapproves inseparable and the monoisotopic mass of the seconddistribution is wrongly detected at m/z 636.96. Further, due toconservative noise level/complexity estimation, the isotopedistribution located at m/z 642.33 (right triangle marker) is notdetected. Note that in both of the distributions located atm/z 636.29 and m/z 639.65, the monoisotopic mass peak doesnot correspond to the most intensive peak.

695 700 705 710 715 720m/z

inte

nsity

(a.

u.)

696.

7369

7.22

700.

8 706.

27 710.

83

714.

29

717

718.

81

720.

39

721.

8

722.

7972

3.29

true positive false positive

Figure 8Peak picking in the m/z 695–725 mass range.Illustration of observed (top) and reconstructed (bottom)spectra. With a single exception, all detected peaks could bemanually confirmed. The peak detected at m/z 714.29corresponds to the first isotope peak of the isotopedistribution located at m/z 713.78 (z=2). In the m/z 718–724region the algorithm proves capable of resolving nontriviallow-intensity mixtures.



unlikely to hold true in reality. In contrast, in absence ofretention time information, NITPICK delivered chargestate information for each and every peak and peaks werecounted as true positives if and only if their assignedcharge state was correct. MarkerView's PPV and SER aresubject to overestimation, whereas NITPICK's PPV is not.Even under this pro-MarkerView bias, if joint maximiza-tion of PPV and sensitivity is desired, NITPICK arguablyproved competitive with MarkerView: despite the 1.6-fold increase in sensitivity, only slightly more than halfof the peaks reported by MarkerView are true positives.

Analysis CPU time on the real-world spectrum was 114son a 2 GHz AMD Opteron machine. Measurements arebased on native, interpreted R code. Preliminary testswith an in-house C++ implementation (to be publishedelsewhere) yielded a speed increase by a factor of ≈ 20.

Conclusion and perspectivesConclusionsWe present NITPICK, an iterative, non-greedy, globallyoptimal mixture modeling approach for feature extrac-tion from multicomponent mass spectra. The calculationof the set of explanatory theoretical isotope distributionsis based on fractional averagine, a mass error-free

extension to the well-known averagine [8] model.Subsequent feature selection is driven by a modifiedleast angle regression [8] algorithm for which we derived asuitable, statistically motivated early stopping criterion.Experiments show that NITPICK is able to unmix anddeconvolve complex mixture mass spectra. The algo-rithm was thoroughly evaluated on simulated and real-world data sets and was found to perform better than aconceptually similar algorithm. NITPICK was even foundto deliver competitive results when compared against avendor-supplied algorithm which, in contrast to NIT-PICK, had retention time resolution available.

We would like to note that although the analysis at handwas confined to a proteomics data set, the application ofthe proposed methodology is in no way limited to thistype of data and can easily be adapted to similarproblems outside the field of proteomics.

NITPICK is available as software package for the Rprogramming language and can be downloaded fromhttp://hci.iwr.uni-heidelberg.de/mip/proteomics/.

PerspectivesThe constrained least squares regression model inequation (3) implicitly assumes Gaussian noise on theobserved spectra. Especially with low-intensity time-of-flight spectra the Gaussian approximation is crude,yielding suboptimal estimates. The incorporation of adata type- and intensity-dependent procedure pursuing asuitable Poisson regression approach [36] in appropriatecases could improve on this shortcoming.

The non-negative least squares step in equation (6)assumes error-free basis functions ji. Although fractionalaveragine improves over the classical averagine model,this assumption is still violated. Possible remediesinclude direct intensity estimation techniques [43, 44]and enhanced sparse feature selection methodologywhich allows for errors in explanatory variables. Alter-natively, extended stoichiometry models could provideproblem-tailored basis functions if model bias is notan issue.

For charge states z < 3 and mass ranges m/z ≲ 1400, thereexist so-called forbidden regions [45] within the massspectrum, i.e. mass ranges which are inaccessible topeptides (including modifications). Such informationhas been reported to be suitable as a preprocessingfilter [31].

Further computational efficiency could be achieved by acomplexity-driven hierarchical estimation approach,resorting to subtractive feature extraction for simple

775 777 779 781m/z

inte

nsity

(a.

u.)

776.

06

779.

32

780.

35


Figure 9Observed (top) and reconstructed (bottom) massspectrum in the m/z 775–782 range. The separation oftwo heavily overlapping isotope distribution clearlyillustrates the benefits of NITPICK's intensity model-basedapproach to the peak picking/feature extraction problem: thesecond isotope peak of the isotope distribution located atm/z 779.32 (charge 2) and the monoisotopic peak of thedistribution located at m/z 780.35 (charge 2) are exactlysuperimposed and can only be distinguished by takingintensity information into account.




signals and to the full mixture modeling for complexsamples only.

AppendixA Computation of fractional averagineFor the computation of the isotopic distribution offractional averagine, we build on the fact that thedistribution of the isotopes of an element follows amultinomial [33]. The multinomial is discrete, hence forfractional counts of events we can interpolate betweenthe two adjacent integer multinomials for each elementsuch that

P X x X x

c c P X x X

n c k k

n c k

= − −

=⎢⎣ ⎥⎦ −

= =

= ⎡⎢ ⎤⎥ −( ) = =

( ,..., )

,...,

1 1 1 1

1 1 1 xx

c c P X x X x

k

n c k k

−

=⎡⎢ ⎤⎥ − −

( )+ − ⎢⎣ ⎥⎦( ) = =( )

1

1 1 1 1,...,

(13)

with Èc˘ = min ( )j j c∈ ≥N , Îc˚ = max ( )j j c∈ ≤N and Xi

representing the number of times the ith isotope of anelement occurs. Under the (reasonable) assumption ofindependence of the atomic distributions of the ele-ments, the resulting joint distribution for a moleculefollows from the multiplication of the distributions of itselements.

By changing the order of multiplication and separatingthe highest possible integer number from the remain-ing fractional numbers, the calculation of fractionalaveragine can be related to the Mercury7 algorithm[14], yielding a highly efficient calculation scheme (seeeq. (2)). For the convolution of the Mercury integerresults and the fractionals we follow [46]: Let gp(i)represent the ith element of the probability vector ofthe first and fp(j) the jth element of the seconddistribution, then

h k g i f k ip p p

i

( ) ( ) ( )= −∑ (14)

can be used to compute hp(k), the kth element of the newvector of probabilities for the joint distribution. Simi-larly, the corresponding mass vector hm can be computedusing the probability vectors gp and fp and thecorresponding mass vectors gm and fm using

h k g i f k i

g i f k i g i f k i

m p p

i

p p m m

i

( ) ( ) ( )

( ) ( )( ( ) ( ))

= −⎛

⎝⎜⎜

⎞

⎠⎟⎟

− + −

∑−1

∑∑ .

(15)

B Proof of the monotony of the GDF for thenon-negative lassoAs long as a given set FF( )l is valid, it can be easilyshown that the GDF are monotonous in l. Starting withthe GDF(l) of equation (10),

GDF( )

( )( ) ( ) ( ) ( ) ( )

l

sl

s

l l l l l= −⎛⎝⎜

⎞⎠⎟

=

−s

s

T T T12

12

12

1FF FF FF FF 1

ii

i

N

T T

i

=

−

∑( ) −⎛

⎝⎜⎞⎠⎟

⎛

⎝⎜

⎞

⎠⎟

=

1

1 12

1

1

FF FF FF FF ( ) ( ) ( ) ( ) ( )l l l l lls

ss

ls

l l l l

l

2

1

2 2

1

1s s

s

i

i

NT T

i

i

=

−∑ ( )⎛⎝⎜

⎞⎠⎟

−

FF FF FF FF

FF FF

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )l l lT

ii

N

FF( )⎛⎝⎜

⎞⎠⎟

−

=∑ 1

1

1

(16)

To show that the GDF are monotonously increasing fordecreasing values of l, it suffices to analyze the followingpart of the formula,

s iT

ii

N

iTe

FF FF FF

FF FF

( ) ( ) ( ) ( )

( ) (

l l l l

l l

( )⎛⎝⎜

⎞⎠⎟

⎛

⎝⎜

⎞

⎠⎟

=

−

=∑ 1

1

1

)) ( ) ( )

( ) ( ) (

( )TT

iT

i

N

T T

eFF

FF FF

l l

l l l

( )⎛⎝⎜

⎞⎠⎟

⎛⎝⎜

⎞⎠⎟

=

−

=∑ 1

1

1

1

s

)) ( )

( ) ( ) ( ) ( )

( )⎛⎝⎜

⎞⎠⎟

= ( )⎛

−

=−

∑ 1

1

11

FF

FF FF FF

l

l l l l

Ti i

T

i

N

T T T

e e s

⎝⎝⎜⎞⎠⎟

= ( )⎛⎝⎜

⎞⎠⎟

=

−

=

IN

T T T

jLS

j

K

s

s

c

11

1

( ) ( ) ( ) ( )

( )

l l l l

l

FF FF FF

∑∑

(17)

where ei denotes the ith canonical unit vector of length Nand I e eN i i

Ti

N= =∑ 1is the identity matrix of size N.

ˆ ( )c jLS l is the least squares regression coefficient for thecorresponding least squares problem of the active set. Itis known that all non-negative lasso coefficients ˆ

( )c l j

q

are greater or equal zero, so



ˆ( )

( . )

( ) ( ) ( ) ( )

c

s

l

l l l l

j

q

j

K

eqT T T

≥

⇔ ( )⎛⎝⎜

⎞⎠⎟

−

=

−

∑ 0

1

1

9 1 FF FF FF

1112

1 0

1

1

1

( ) ( ) ( ) ( )

( ) ( ) ( ) (

l l l l

l l l l

lT T

T T

FF FF

FF FF FF

( ) ≥

⇔ ( )

−

−))

( ) ( ) ( ) ( )

ˆ ( )

T

T T

jLS

j

K

⎛⎝⎜

⎞⎠⎟

≥ ( ) ≥

⇔

≥

−

=∑

s

c

12

1 1 01

1

l l l l l

l

FF FF

112

1 1 01

l l l l l ( ) ( ) ( ) ( )T TFF FF( ) ≥

−

(18)

as FF FF ( ) ( )l lT is the inverse of a covariance matrix and,

thus, positive-semidefinite, and l is by definition alwaysgreater or equal 0. Thus, the second part of equation (16)is monotone with regard to l and therefore the GDFs aremonotone as long as a given active set is valid.

It remains to be shown that changes of FF( )l do notinfluence the monotony, so it needs to be shown thatneither the addition of jj to the set FF( )l nor the removalof jk from FF( )l lead to a decrease of cov(s, FF ( ) ( )l lc q )as given in (10). A formal proof is given further below,nevertheless, this can also be argued intuitively.

In the non-negative LARS implementation as describedabove and in [39], a variable jj will be added to theactive set f l( ) only if it is positively correlated with theremaining residuals, i. e. if

cov , ( ) ( )f l ljqs c−( ) >FF 0 (19)

This obviously leads to an increase of cov(s, FF ( ) ( )l lc q ) as less unexplained variation remains.A variable jk is removed from the active set FF( )l onlyif cov (jk, s - FF ( ) ( )l lc q ) < 0, so if the residuals arenegatively correlated with the variable its removal leadsto an increase of cov (s, FF ( ) ( )l lc q ) as well.

Thus, as long as changes of the set FF( )l appear one at atime (which is ensured by the active set implementa-tion), they do not influence the monotonous characterof the estimate of the degrees of freedom.

More formally, when a variable jj is added to the current setof variables FF( )l , the solution for FF FF ( ) ( )l l f

+= ∪ j

can be constructed from the solution of FF( )l in thefollowing manner [39]:

FF FF ( ) ( ) ( ) ( ) ( )l l l l lg+ + +

= +c cq q

u (20)

where

ˆ minˆ ˆ

( )( )

gll=−

−

⎧⎨⎪

⎩⎪

⎫⎬⎪

⎭⎪>∈

+j C

D d jB b j

0 (21)

is strictly positive by definition and gives the magnitudeof the change.

ˆ ( ˆ )( ) ( )d T q= −FF FFs c l l (22)

is the vector of the current correlation and

ˆ max {ˆ | ˆ }.D d dj j j= > 0 (23)

In addition,

B T T ( ) ( ) ( ) ( ) ( )l l l l l= ( )⎛

⎝⎜⎞⎠⎟

− −

1 1

121

FF FF (24)

and

u B T ( ) ( ) ( ) ( ) ( ) ( )l l l l l l= ( )−FF FF FF

11 (25)

leading to

b uT= FF ( ).l (26)

We need to show that

cov , cov ,

cov ,

( ) ( ) ( ) ( )

( ) (

s c s c

s c

FF FF

FF

l l l l

l l

+ +

+

( ) ≥ ( )⇔

q q

)) ( ) ( )

( ) ( ) ( ) ( ) .

+

+ +

−( ) ≥⇔ −( ) ≥

q q

T q q

FF

FF FF

l l

l l l l

c

s c c

0

0

(27)

Using the construction of FF ( ) ( )l l+ +c q from above, this

leads to

⇔ + −( ) ≥⇔ ≥

+

+

s c c

s

T q q

T

u

u

FF FF

( ) ( ) ( ) ( ) ( )

( )( ) .

l l l l l

l

g

g

0

0(28)

It is known from its definition that g is strictly positive,thus it can be dropped from the inequality and

s

s

T

T T

u

B

+−

≥

⇔ ( ) ≥+ + + + +

0

1 01

FF FF FF ( ) ( ) ( ) ( ) ( ) .l l l l l

(29)

It is also known from the idea of the non-negative lassothat all variables in XA are positively correlated with theremaining residuals, so



cov ,( ) ( ) ( )

( ) ( ) ( )

(

FF FF

FF FF

FF

l l l

l l l

s c

s c

s

−( ) ≥⇔ −( ) ≥

⇔

q

q T

T

0

0

ll l l l) ( ) ( ) ( ).≥ ( )FF FF c qT

(30)

Using this result,

s

c

T T

q T

BFF FF FF

FF FF

( ) ( ) ( ) ( ) ( )

( ) ( ) (

l l l l l

l l

+ + + + +

+ +

( )≥ ( )

−11

ll l

l l l

) ( )

( ) ( ) ( )

+ +

+ + +( )−B

T

FF FF11

(31)

holds true and it suffices to show that

FF FF

FF FF

FF

( ) ( ) ( ) ( )

( ) ( ) ( )

(

l l l l

l l l

+ + + +

+ + +

( )( ) ≥

⇔

−

c qT

T

B

11 0

ll l

l l l

) ( )

( ) ( ) ( ) .

+ + +

+ + +

( ) ≥

⇔ ≥

c

c

q T

q T T

u

u

0

1 0FF

(32)

When further recalling the fact from [39] thatFF ( ) ( ) ( ) ( )l l l l

T u B= 1 , this can be reduced to

ˆ ,( ) ( ) ( )c l l l+ + +( ) ≥q TB 1 0 (33)

but as B( )l +is strictly positive by definition, it follows

that

ˆ

ˆ .

( ) ( )

( )

c

c

l l

l

+ +

+

( ) ≥

⇔ ( ) ≥∑q T

q

ii

1 0

0(34)

This is always fulfilled for the non-negative lasso as itis the constraint on its initial optimization problem.The case of the removal of jk from FF( )l can be arguedalmost identically with the only difference being thatnow

FF FF ( ) ( ) ( ) ( ) ( )l l l l lg+ + +

= +c cq q u (35)

where

g gg

=>

min{ }j

j0 (36)

which is also always positive and thus can be droppedfrom the resulting inequality in exactly the same fashionas g could be dropped for the case of the addition of avariable. Consequently, changes in FF( )l do not changethe monotony of the GDF estimate.

C Lower bound properties of BICmin

BICmin is a lower bound for BIC, if ∀k ≥ i

BICmin(i) ≤ BIC(k), (37)

which equals

Ndf N

Ndf Ni i k

sel

sel l

2 2MSE MSE+ ≤ +( ) log ( ) ( ) log

(38)

which is always fulfilled because MSE ≤ MSE(li) anddf (li) ≤ df (lk) for i ≤ k and N ≥ 1, s e

2 > 0.

D SNR definition for simulated spectraGiven the undistorted simulated signal s, the effect ofPoisson noise is simulated with si ← vi, where vi is drawnfrom a Poisson distribution with mean ksi + 1. Thesignal-to-noise ratio (SNR) thus depends on the para-meter k. In order to determine k for a selected set of SNRvalues, we consider the definition

SNRs

ss

n

2

2. (39)

The empirical variance of the original signal s multipliedby a scalar k is defined as

s s i

i

N

k k s s2 2 2

1

( ) ( ) ,−=∑ (40)

where s denotes the mean over all si. For Poisson noise,location and dispersion parameters coincide, i.e. withX ~ P (l) we have Var(X) = E (X) = l, and weapproximate the variance of a set of Poisson variablesni ~ P (ksi), i = 1, ..., N by their average

s n i

i

N

kN

ks2

1

1( ) .

=∑ (41)

For a given SNR, this allows the estimation of k because

SNR = =s

s

s

ss k

n kk s

n

2

2

2

2( )

( )(42)

and thus

k n

s= s

s

2

2SNR. (43)



Authors' contributionsBYR and MK have developed the methodology, imple-mented the software, carried out the data analysis anddrafted the manuscript. HS and JAJS have contributed tothe basic methodology and the manuscript, carried outcritical review and provided application feedback andevaluation for the proposed methods. FAH has suggestedthe fractional averagine approach, and has contributedto the manuscript and the overall project design. Allauthors have read and approved the final manuscript.

Additional material

Additional file 1A zip folder containing all simulation files (R data files).Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-355-S1.zip]

Additional file 2The zipped original LC/MS .wiff-file on which MarkerView was run(as acquired by the AB/Sciex QStar instrument)Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-355-S2.zip]

Additional file 3The original spectrum of BSA-sample.wiff integrated over retentiontime (23.817-29.278 minutes) on which NITPICK was run.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-355-S3.txt]

Additional file 4A zip-folder containing all pepex results on the simulated data.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-355-S4.zip]

Additional file 5A zip-folder containing all NITPICK results on the simulated data setsand for each SNR (SNR in 5, 10,25, 50, 100) a R data file called•resultList_0_’SNR’_0.1.RDA gives the peaks found by NITPICK for allspectra of a certain SNR• lengthResultList_0_’SNR’_0.1.RDA gives thenumber of peaks found by NITPICK for each spectrum•correct_0_’SNR’_0.1.RDA gives the number of correctly identified peaksfound by NITPICK for each spectrum• tooMany_0_’SNR’_0.1.RDA givesthe number of incorrectly identified peaks found by NITPICK for eachspectrum• pp_resultList_0_3_0_’SNR’_0.1.RDA gives the peaks found byNITPICK for all spectra of a certain SNR after postprocessing with g=3•lengthResultList_0_3_0_’SNR’_0.1.RDA gives the number of peaks foundby NITPICK for each spectrum after postprocessing with g=3•correct_0_3_0_’SNR’_0.1.RDA gives the number of correctly identifiedpeaks found by NITPICK for each spectrum after postprocessing with g=3•tooMany_0 3_0_’SNR’_0.1.RDA gives the number of incorrectly identifiedpeaks found by NITPICK for each spectrum after postprocessing with g=3Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-355-S5.zip]

Additional file 6Excel sheet containing the peaks detected by NITPICK (mz-position,charge, intensity) as well as their manual validation.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-355-S6.xls]

Additional file 7Excel sheet containing the peaks detected by MarkerView(mz-position, charge, if available) as well as their manual validation.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-9-355-S7.xls]

AcknowledgementsThe authors would like to thank Yin Yin Lin (Dept. of Pathology,Children's Hospital, Boston, MA, USA) for LC/MS data acquisition, LyleBurton (AB/MDS Sciex, Concord, Canada) for MarkerView 1.2 evaluationversions, Ullrich Köthe, Linus Görlitz, Björn Menze, Michael Kelm(Interdisciplinary Center for Scientific Computing (IWR), University ofHeidelberg, Germany), and Flavio Monigatti (Dept. of Pathology,Children's Hospital, Boston, MA, USA) for comments, suggestions, andfruitful discussions. We gratefully acknowledge financial support by theHans L. Merkle foundation (M.K.), the Karl Steinbuch Scholarship (B.Y.R.),dm/Filiadata GmbH (B.Y.R.), Robert Bosch GmbH (F.A.H.), the Children'sHospital Trust (J.A.J.S. and H.S.), and the DFG under grant no. HA4364/2-1(B.Y.R., F.A.H).

References1. Jensen ON: Interpreting the protein language using proteo-

mics. Nature Reviews Molecular Cell Biology 2006, 7(6):391–403.2. Beretta L: Proteomics from the Clinical Perspective: Many

Hopes and Much Debate. Nature Methods 2007, 4(10):785–786.3. Schwartz SA, Weil RJ, Johnson MD, Toms SA and Caprioli RM:

Protein Profiling in Brain Tumors Using Mass Spectro-metry: Feasibility of a New Technique for the Analysis ofProtein Expression. Clinical Cancer Research 2004, 10:981–987.

4. Claydon MA, Davey SN, Edwards-Jones V and Gordon DB: TheRapid Identification of Intact Microorganisms Using MassSpectrometry. Nature Biotechnology 1996, 14:1584–1586.

5. Pineda FJ, Antoine MD, Demirev PA, Feldman AB, Jackman J,Longenecker M and Lin JS: Microorganism Identification byMatrix-Assisted Laser/Desorption Ionization Mass Spectro-metry and Model-Derived Ribosomal Protein Biomarkers.Analytical Chemistry 2003, 75(15):3817–3822.

6. Zhang Z and Marshall AG: A Universal Algorithm for Fast andAutomated Charge State Deconvolution of ElectrosprayMass-to-Charge Ratio Spectra. Journal of the American Society forMass Spectrometry 1998, 9(3):225–33.

7. Yu W, Wu B, Lin N, Stone K, Williams K and Zhao H: Detectingand Aligning Peaks in Mass Spectrometry Data withApplications to MALDI. Computational Biology and Chemistry2006, 30:27–38.

8. Senko M, Beu S and McLafferty F: Determination of Mono-isotopic Masses and Ion Populations for Large Biomoleculesfrom Resolved Isotopic Distributions. Journal of the AmericanSociety for Mass Spectrometry 1995, 6:229–233.

9. Horn DM, Zubarev RA and McLafferty FW: Automated Reduc-tion and Interpretation of High Resolution ElectrosprayMass Spectra of Large Molecules. Journal of the American Societyfor Mass Spectrometry 2000, 11(4):320–332.

10. Wehofsky M, Hoffmann R, Hubert M and Spengler B: IsotopicDeconvolution of Matrix-Assisted Laser Desorption/Ioniza-tion Mass Spectra for Substance-Class Specific Analysis ofComplex Samples. European Journal of Mass Spectrometry 2001,7:39–46.

11. Gras R, Muller M, Gasteiger E, Gay S, Binz PA, Bienvenut W,Hoogland C, Sanches JC, Bairoch A, Hochstrasser DF and Appel RD:Improving Protein Identification from Peptide Mass



http://www.ncbi.nlm.nih.gov/pubmed/16723975?dopt=Abstract




















Fingerprinting through a Parameterized Multi-LevelScoring Algorithm and an Optimized Peak Detection.Electrophoresis 1999, 20:3535–3550.

12. Rockwood A, Van Orden S and Smith R: Rapid Calculation ofIsotope Distributions. Analytical Chemistry 1995, 67:2699–2704.

13. Rockwood A, Van Orden SL and Smith RD: Ultrahigh-SpeedCalculation of Isotope Distributions. Analytical Chemistry 1996,68:2027–2030.

14. Rockwood A and Haimi P: Efficient Calculation of AccurateMasses of Isotopic Peaks. Journal of the American Society for MassSpectrometry 2006, 17:415–419.

15. Yergey JA: A General Approach to Calculating IsotopicDistributions for Mass Spectrometry. International Journal ofMass Spectrometry and Ion Physics 1983, 52:337–349.

16. Senko M: Isopro 3.0.1997 http://members.aol.com/msmssoft/.17. Breen EJ, Hopwood FG, Williams KL and Wilkins MR: Automatic

Poisson Peak Harvesting for High Throughput ProteinIdentification. Electrophoresis 2000, 21:2243–2251.

18. Chen L, Sze SK and Yang H: Automated Intensity DescentAlgorithm for Interpretation of Complex High-ResolutionMass Spectra. Analytical Chemistry 2006, 78:5006–5018.

19. Kaur P and O'Connor PB: Algorithms for automatic inter-pretation of high resolution mass spectra. Journal of theAmerican Society for Mass Spectrometry 2006, 17(3):459–468.

20. Szymura JA and Lamkiewicz J: Band Composition Analysis: anew Procedure for Deconvolution of the Mass Spectra ofOrganometallic Compounds. Journal of Mass Spectrometry 2003,38:817–822.

21. Wehofsky M and Hoffmann R: Automated Deconvolution andDeisotoping of Electrospray Mass Spectra. Journal of MassSpectrometry 2002, 37:223–229.

22. Zhang X, Hines W, Adamec J, Asara JM, Naylor S and Regnier FE: AnAutomated Method for the Analysis of Stable IsotopeLabeling Data in Proteomics. Journal of the American Society forMass Spectrometry 2005, 16:1181–1191.

23. Mason CJ, Therneau TM, Eckel-Passow JE, Johnson KL, Oberg AL,Olson JE, Nair KS, Muddiman DC and Bergen HRI: A Method forAutomatically Interpreting Mass Spectra of 18O LabeledIsotopic Clusters. Molecular & Cellular Proteomics 2006, 6:305–318.

24. Wang W, Zhou H, Lin H, Roy S, Shaler TA, Hill LR, Norton S,Kumar P, Anderle M and Becker CH: Quantification of Proteinsand Metabolites by Mass Spectrometry without IsotopicLabeling or Spiked Standards. Analytical Chemistry 2003,75:4818–4826.

25. Senko MW, Beu SC and McLafferty FW: Automated Assignmentof Charge States from Resolved Isotopic Peaks for MultiplyCharged Ions. Journal of the American Society for Mass Spectrometry1995, 6:52–56.

26. Tabb DL, Shah MB, Strader MB, Conelly HM, Hettich RL andHurst GB: Determination of Peptide and Protein ion ChargeStates by Fourier Transformation of Isotope-Resolved MassSpectra. Journal of the American Society for Mass Spectrometry 2006,17:903–915.

27. Listgarten J and Emili A: Statistical and Computational Methodsfor Comparative Proteomic Profiling Using Liquid Chro-matography-Tandem Mass Spectrometry. Molecular and Cel-lular Proteomics 2005, 4(4):419–434.

28. Fernández-de-Cossio J, Gonzalez LJ, Satomi Y, Betancourt L,Ramos Y, Huerta V, Besada V, Padron G, Minamino N andTakao T: Automated Interpretation of Mass Spectra ofComplex Mixtures by Matching of Isotope Peak Distribu-tions. Rapid Communications in Mass Spectrometry 2004,18:2465–2472.

29. Roussis SG and Proulx R: Reduction of Chemical Formulasfrom the Isotopic Peak Distributions of High-ResolutionMass Spectra. Analytical Chemistry 2003, 75:1470–1482.

30. Samuelsson J, Dalevi D, Levander F and Rögnvaldsson T: Modular,Scriptable and Automated Analysis Tools for High-Throughput Peptide Mass Fingerprinting. Bioinformatics 2004,20:3628–3635.

31. Du P and Angeletti RH: Automatic Deconvolution of Isotope-Resolved Mass Spectra Using Variable Selection andQuantized Peptide Mass Distribution. Analytical Chemistry2006, 78:3385–3392.

32. Tibshirani R: Regression Shrinkage and Selection via theLASSO. Journal of the Royal Statistical Society 1996, Series B58:267–288.

33. Kaur P and O'Connor PB: Use of Statistical Methods forEstimation of Total Number of Charges in a Mass Spectro-metry Experiment. Analytical Chemistry 2004, 76:2756–2762.

34. Casella G and Berger RL: Statistical Inference Duxbury Press; 2001.35. Lawson CL and Hanson RJ: Solving Least Squares Problems Prentice-

Hall, Englewood Cliffs, N J; 1974.36. Park MY and Hastie T: An L1 Regularization-path Algorithm for

Generalized Linear Models. Journal of the Royal Statistical Society,Series B 2007, 69:659–677.

37. Hastie T, Tibshirani R and Friedman J: The Elements of StatisticalLearning; Data Mining, Inference, and Prediction Springer Verlag NewYork; 2001.

38. Ye J: On Measuring and Correcting the Effects of DataMining and Model Selection. Journal of the American StatisticalAssociation 1998, 93:120–131.

39. Efron B, Hastie T, Johnstone I and Tibshirani R: Least AngleRegression. Annals of Statistics 2004, 32(2):407–499.

40. Zou H, Hastie T and Tibshirani R: On the "Degrees of Freedom"of the Lasso. Annals of Statistics 2007, 35(5):2173–2192.

41. Bairoch A and Apweiler R: The SWISS-PROT ProteinSequence Database and its Supplement TrEMBL in 2000.Nucleic Acids Research 2000, 28:45–48.

42. Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A andLe QT: Sample Classification from Protein Mass Spectro-metry, by Peak Probability Contrasts. Bioinformatics 2004, 20(17):3034–3044.

43. Wallace WE, Kearsley AJ and Guttman CM: An Operator-Independent Approach to Mass Spectral Peak Identificationand Integration. Analytical Chemistry 2004, 76:2446–2452.

44. Kearsley AJ, Wallace WE, Bernal J and Guttman CM: A NumericalMethod for Mass Spectral Data Analysis. Applied MathematicsLetters 2005, 18:1412–1417.

45. Mann M: Useful Tables of Possible and Probable PeptideMasses. 43rd Conference on Mass Spectrometry and Allied Topics 1995.

46. Rockwood AL, Kushnir MM and Nelson GJ: Dissociation ofindividual isotopic peaks: predicting isotopic distributions ofproduct ions in MSn. Journal of the American Society for MassSpectrometry 2003, 14(4):311–22.

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral








http://members.aol.com/msmssoft/

http://members.aol.com/msmssoft/


















































http://www.biomedcentral.com/info/publishing_adv.asp


NITPICK: peak identification for mass spectrometry data

Documents

Transcript of NITPICK: peak identification for mass spectrometry data