Peek a peak: a glance at statistics for quantitative label-free proteomics

13
249 Review www.expert-reviews.com ISSN 1478-9450 © 2010 Expert Reviews Ltd 10.1586/EPR.09.107 Quantitative proteomics The proteome of a cell is a highly dynamic system. Small perturbations in the protein composition of a cell may lead to significant differences by the activation or inhibition of important pathways. This in turn may cause a cell to differentiate, proliferate, or even lead to apoptosis. Because of these mechanisms, changes in the proteome are also important factors in the pathogenesis and progression of many diseases. Among them are cancer or neurodegenerative diseases such as Alzheimer’s or Parkinson’s disease. To understand such diseases and to even finally discover possible targets for drug development, it is hence essential to understand the changes in the proteome of the diseased cells [1–4] . With today’s knowledge of the cell being a complex, dynamic system, it is particu- larly important to evaluate not only the qualitative but also the quantitative changes occurring within those cells. This is usually done through methods for quantitative proteomics [5] . A well-established method for quantification of huge and complex protein samples is 2D gels [6,7] . The intact proteins are separated and the intensities of individual protein spots can be compared across different samples. However, this method is relatively labor intensive and time consuming and, hence, hardly usable for high throughput quantification. Thus, during recent years, mass spectrometry (MS) methods have become increasingly popular for quantitative pro- teomics [8,9] . This has been possible thanks to the increased sensitivity and mass accuracy achieved with modern mass spectrometers. Generally, the so-called ‘shotgun proteomics’ approach is used for quantification with MS. This means that proteins are first digested into peptides and the derived peptide mixture is separated and mea- sured by MS, often also coupled with possibly several dimensions of liquid chromatography (LC). Quantification by MS has so far mainly been performed with different kinds of isotopic labeling of samples [10–12] . Stable isotope-labeled peptides can even be spiked into samples in known quantities as internal standards for abso- lute quantification (AQUA). Labeling of the pro- tein or peptide mixture generally requires mul- tiple steps of sample preparation before the actual quantitative analysis can take place [13,14] . This complicates sample handling and runs the risk of introducing technical variation in addition to any existing biological variation between samples. Thus, several label-free approaches have recently been used for quantification with MS. FIGURE 1 depicts possible computational workflows for label-free quantification. Label- free approaches may be divided into two main groups by the way that the abundance of a peptide is measured. The first group comprises methods that are based on the ion count and Katharina Podwojski, Martin Eisenacher, Michael Kohl, Michael Turewicz, Helmut E Meyer, Jörg Rahnenführer and Christian Stephan Author for correspondence Medizinisches Proteom-Center, Ruhr-Universität Bochum, Zentrum für Klinische Forschung (ZKF 1), Universitätsstraße 150, 44801 Bochum, Germany Tel.: +49 234 322 9288 Fax: +49 234 321 4554 christian.stephan@ ruhr-uni-bochum.de Today, label-free mass spectrometry methods are frequently used for quantification of proteins and peptides. There have been several proposals of measurable parameters that best reflect quantities, such as peak areas as well as spectral counts. This review provides a systematic overview of the proposed methods. Owing to the shotgun proteomics approach generally used today for label-free mass spectrometry, any quantitative measure in the first place is a measure of peptide quantity. There has been no systematic research on how to best infer protein quantity from its measured peptides’ quantities. The way peptide identifications are assembled to protein lists may especially lead to significantly different results in protein quantification. A further focus of this review will thus be the assembly of measured peptide quantities to a protein quantity. KEYWORDS: identification • label-free • mass spectrometry • quantification • spectral counting Peek a peak: a glance at statistics for quantitative label-free proteomics Expert Rev. Proteomics 7(2), 249–261 (2010) For reprint orders, please contact [email protected]

Transcript of Peek a peak: a glance at statistics for quantitative label-free proteomics

249

Review

www.expert-reviews.com ISSN 1478-9450© 2010 Expert Reviews Ltd10.1586/EPR.09.107

Quantitative proteomicsThe proteome of a cell is a highly dynamic system. Small perturbations in the protein composition of a cell may lead to significant differences by the activation or inhibition of important pathways. This in turn may cause a cell to differentiate, proliferate, or even lead to apoptosis. Because of these mechanisms, changes in the proteome are also important factors in the pathogenesis and progression of many diseases. Among them are cancer or neurodegenerative diseases such as Alzheimer’s or Parkinson’s disease. To understand such diseases and to even finally discover possible targets for drug development, it is hence essential to understand the changes in the proteome of the diseased cells [1–4]. With today’s knowledge of the cell being a complex, dynamic system, it is particu-larly important to evaluate not only the qualitative but also the quantitative changes occurring within those cells. This is usually done through methods for quantitative proteomics [5].

A well-established method for quantification of huge and complex protein samples is 2D gels [6,7]. The intact proteins are separated and the intensities of individual protein spots can be compared across different samples. However, this method is relatively labor intensive and time consuming and, hence, hardly usable for high throughput quantification. Thus, during recent years, mass spectrometry (MS) methods have

become increasingly popular for quantitative pro-teomics [8,9]. This has been possible thanks to the increased sensitivity and mass accuracy achieved with modern mass spectrometers. Generally, the so-called ‘shotgun proteomics’ approach is used for quantification with MS. This means that proteins are first digested into peptides and the derived peptide mixture is separated and mea-sured by MS, often also coupled with possibly several dimensions of liquid chromatography (LC). Quantification by MS has so far mainly been performed with different kinds of isotopic labeling of samples [10–12]. Stable isotope-labeled peptides can even be spiked into samples in known quantities as internal standards for abso-lute quantification (AQUA). Labeling of the pro-tein or peptide mixture generally requires mul-tiple steps of sample preparation before the actual quantitative ana lysis can take place [13,14]. This complicates sample handling and runs the risk of introducing technical variation in addition to any existing biological variation between samples.

Thus, several label-free approaches have recently been used for quantif ication with MS. Figure  1 depicts possible computational workflows for label-free quantification. Label-free approaches may be divided into two main groups by the way that the abundance of a peptide is measured. The first group comprises methods that are based on the ion count and

Katharina Podwojski, Martin Eisenacher, Michael Kohl, Michael Turewicz, Helmut E Meyer, Jörg Rahnenführer and Christian Stephan†

†Author for correspondenceMedizinisches Proteom-Center, Ruhr-Universität Bochum, Zentrum für Klinische Forschung (ZKF 1), Universitätsstraße 150, 44801 Bochum, GermanyTel.: +49 234 322 9288Fax: +49 234 321 4554christian.stephan@ ruhr-uni-bochum.de

Today, label-free mass spectrometry methods are frequently used for quantification of proteins and peptides. There have been several proposals of measurable parameters that best reflect quantities, such as peak areas as well as spectral counts. This review provides a systematic overview of the proposed methods. Owing to the shotgun proteomics approach generally used today for label-free mass spectrometry, any quantitative measure in the first place is a measure of peptide quantity. There has been no systematic research on how to best infer protein quantity from its measured peptides’ quantities. The way peptide identifications are assembled to protein lists may especially lead to significantly different results in protein quantification. A further focus of this review will thus be the assembly of measured peptide quantities to a protein quantity.

Keywords: identification • label-free • mass spectrometry • quantification • spectral counting

Peek a peak: a glance at statistics for quantitative label-free proteomicsExpert Rev. Proteomics 7(2), 249–261 (2010)

For reprint orders, please contact [email protected]

Expert Rev. Proteomics 7(2), (2010)250

Review Podwojski, Eisenacher, Kohl et al.

compare either maximum abundance or volume of ion count for peptide peaks between different samples [15–18]. The second group is based on the identification of peptides by MS/MS and uses sampling statistics such as peptide count, spectral counts, or sequence coverage to quantify the differences between sam-ples [19–22]. What is, with few exceptions [23], identical for all methods of label-free quantification is that quantification is generally only relative. This is in part due to differences in ion-ization efficiency for different peptides. These result in different measurable abundances, even when the peptides have the same concentration. Thus, even peptides from the same protein may be measured with different abundances. Furthermore, there is the possibility of ion suppression effects. The general notion is, however, that ionization efficiencies and ion suppression affect the measured abundances only such that abundance ratios between different groups remain unaltered. Hence quantities are generally compared across different samples through mea-suring ratios or simply testing for differences between different groups. In this way, relative quantification is obtained.

In order to derive good estimates of the true ratios between groups, the metric measuring of abundance of a protein has to be linearly correlated with true abundance or has to be transformable to such a metric. Linearity ensures that a change in abundance between different groups can be estimated through the ratio of abundances of these groups.

There have been several arguments for one or the other method of label-free quantification. So far, little research has addressed the problem that, owing to the shotgun approach, most methods directly compare only peptides and not whole proteins. Thus, a further inference step has to be undertaken from the quantities

measured on the peptide level to the actual quantities of the cor-responding proteins. For this step, it is essential to identify the pep-tides in the samples and build protein lists from the peptide identi-fications [24]. Owing to ambiguities in sequences from different proteins, this is not a trivial task. The complexity and evolutionary recycling of domains and sequences within the genome results in tryptic peptides being present several times in the proteome [25,26]. In our experience, over 50% of the peptides are present several times in some types of databases, depending on redundancy and completeness of protein databases. We call a peptide non-unique, if its sequence belongs to several database entries.

Identifying peptides and correctly assigning them to proteins is the first step for protein quantification. The next is that peptide ratios have to be combined to protein ratios. Several approaches have already been introduced in isotopic-labeling approaches for this purpose [27,28]. Furthermore, measuring the abundance of a peptide does not necessarily allow direct inference on a protein’s abundance. It may well be that the abundance of the peptide is a composition of abundances from several proteins. Up until now, it has not been clear how best to handle such peptides. We will, therefore, further discuss this problem and highlight possible solutions.

Label-free quantification through ion countThe typical LC-MS workflow first separates a digested protein sam-ple with LC. The frequently used reverse-phase column binds the peptides from the sample. A gradient of solvent causes elution of the peptides from the column according to the peptides’ hydrophobic-ity. An electrospray ionization source directly generates gas-phase ions from the peptide solution eluting from the LC column. The

Ion count methods

Peptide lists

FDR

Database search andpeptide identification

Spectral counting

Combining toprotein lists

Quantified features

Retention time alignment

Denoising, smoothingand baseline removal

Peak detection, de-isotopingand charge-state deconvolution

LC-MS/MS

Quantification Identification

Matching across maps

Spectral counts

Normalization

Ratios, tests and classification

Figure 1. Possible workflows for label-free quantification of liquid chromatography with tandem mass spectrometry data.FDR: False discovery rate; LC: Liquid chromatography; MS: Mass spectrometry.

www.expert-reviews.com 251

ReviewPeek a peak: a glance at statistics for quantitative label-free proteomics

mass spectrometer then separates the ions by their mass-to-charge (m/z) ratio [8,29]. Often, a further round of MS is used to derive fragment ion spectra (MS/MS) from the most intense precursor ions within each mass spectrum. These MS/MS spectra are after-wards used for peptide identification. A LC-MS experiment thus results in measurements of ion abundance at each LC retention time (RT) and m/z ratio within the detection limits defined by the technical and experimental setup. We call the combined data from such an experiment a LC-MS map, as it is possible to display the data in a 2D map where RT and m/z represent the two axes and the abundance can be displayed through different color shadings. The measured ions are the ionized peptides. Hence, ion abundance should reflect peptide abundance except for spurious and systematic noise introduced by the experimental setup. The idea is to use the measured ion abundances for quantitative analyses.

A ready approach would be to directly compare ion abundances in raw LC-MS maps as was also carried out in early analyses with matrix-assisted laser desorption/ionization (MALDI) and surface-enhanced laser desorption/ionization (SELDI) mass spec-tra [30–32]. However, a peptide does not result in a single signal in the LC-MS map. Peptides usually elute from the LC column for some period of time, resulting in signal across that time range. And, to a lesser extent, a peptide results in signals measured across a certain m/z range. This is, on the one hand, due to different isotopic variants of the peptide as well as measurement errors. Especially with ionization with an electrospray ionization source, several different charge states are frequently measured for a spe-cific peptide. This also results in signals across different regions of the LC-MS map belonging to the same peptide. Owing to technical variations, there may, furthermore, be perturbations in RT, m/z and intensity. When the mass spectrometer is calibrated regularly, the deviations in m/z on modern mass spectrometers may be down to several parts-per-million (ppm). But deviations in RT, especially in large studies comprising up to hundreds of samples, may span up to several minutes. This is especially the case when samples have been processed on several instruments or when the LC column is changed in between samples from the same study. Thus, when comparing unprocessed data across sev-eral samples, one may not necessarily compare the same peptide across all samples. And last but not least, signals in LC-MS maps may be due to chemical and technical noise and not due to real peptides. These parts of the map should not be used for ana lysis at all. It has thus become widely accepted to perform several steps of preprocessing to extract true signals from the measured data and, hence, derive spectral peaks representing true peptides that are comparable across different samples.

The preprocessing of LC-MS data generally comprises several (or all) of the following steps [17]. In the first steps, noise can be removed and baselines can be subtracted from the raw LC-MS spectra. The next steps may be peak detection and quantification. Peaks can be further assembled to biologically meaningful features through de-isotoping or identifying and even merging of different charge states from the same peptide, often referred to as charge state deconvolution. Finally, to detect features representing the same peptide, alignment and merging of peaks or features across

samples is needed for comparison of several samples. Also, normal-ization of feature abundances can remove some of the technical variations between several samples.

So far, it has not been assessed how the individual preprocessing tasks should be optimally arranged. In fact, this is not a trivial task and may well depend on the individual algorithms used. Even though many of the steps can theoretically be applied indepen-dently of each other, it is still not clear how possible errors made in an early preprocessing step might carry forward in proceeding steps. On the other hand, results from previous tasks may boost the performance of later processing steps, either in accuracy or in run time. Finally, some of the steps might be possible and worth performing simultaneously [33]. Even though comprehensive analyses and comparisons of complete preprocessing procedures have not been reported yet, much active research is performed on the individual aspects of preprocessing.

Overall, it is hardly possible to recommend one or the other pro-cedure. The optimal solution will be dependent on both the instru-ments used and the scientific goal of an experiment. For example, the resolution of a mass spectrometer has influence on the optimal peak detection method. On the other hand, large and complex samples will need more sophisticated alignment and matching methods than small experiments with standard protein mixtures.

Denoising & smoothing of raw spectraTechnical noise in LC-MS data result in spurious ion signals. These spurious peaks have to be discerned from true peaks result-ing from peptides. True peptide peaks are typically visible over a certain m/z and RT range, which is not the case for noise peaks. Denoising or smoothing is often used as a first step of data pro-cessing for reducing or removing those noise peaks. Different filter methods have especially been applied for this task. Such filters may be applied both in RT domain and m/z domain. For the latter case, methods from MALDI and SELDI spectra pre processing may be readily transferred. The Savitzky-Golay filter has been used for denoising and baseline removal [34,35]. Barclay et al. have compared Savitzky-Golay, Fourier transform and wavelet transform filtering for smoothing and denoising of spectra [36]. Translation-invariant wavelet transforms for smooth-ing have also been used [37]. Further examples for smoothing include moving averages [38], Gaussian filter [34], and kernel den-sity estimation [39]. Noy and Fasulo have afterwards used the top-hat filter for baseline removal [39].

Some filtering methods might need regular intervals between measurements. This can generally be assumed for RT but not for m/z. Thus, some groups perform a binning of the m/z domain before further data manipulation [16,38,40]. The most important matter here is to find a suitable width that offers the right trade-off between sufficient entries in each bin to avoid sparsity and yet enough bins to keep the necessary information within m/z.

Generally, most instrument software is capable of performing first processing steps for raw data and often automatically does so silently. This has to be kept in mind when performing the afore-mentioned procedures or using the derived data in the following processing steps.

Expert Rev. Proteomics 7(2), (2010)252

Review Podwojski, Eisenacher, Kohl et al.

Peak/feature detection & quantificationThe detection of peaks may have several advantages over using raw LC-MS signals. The most evident advantage is that peaks compris-ing the signal, originating from one specific peptide, directly link biological meaning to peak data. Furthermore, peak detection is also a means of data reduction and thus may simplify the subse-quent ana lysis steps. Even though peaks can generally be directly defined in 2D space [35], peaks are often first detected individually in each single mass spectrum. In this case, peak detection is com-parable to the selfsame for MALDI/SELDI spectra, where many procedures have been introduced [41–43]. Peaks in a single spectrum are often defined as simple local maxima [39,44]. While Noy et al. used the smoothed spectrum to detect those local maxima [39], others have used wavelet decomposition [34,44].

Modern mass spectrometers have very high resolution that even makes different isotopic variants of peptides visible. This property can be used to further characterize features in LC-MS maps. By comparing signals or already detected peaks with theoretical iso-topic patterns of peptides, features can be detected that represent a peptide. Such strategies are now commonly applied [34,39,45,46]. The differences between the individual isotopic peaks of a peptide also permit charge state deconvolution. Charge states might either be combined to a single entry for one peptide or be kept individu-ally. If the feature detection has been carried out individually on each spectrum, then the features afterward have to be combined along retention time [34,46].

Finally, a list of identified features giving several measures is gen-erally returned. Measures might be feature position (i.e., mono-isotopic mass, RT), charge state and abundance. There are two common ways to define feature abundance. The first is to integrate the peak area either in the mass domain, or in both the mass and RT domains [46]. The second is to return the peak intensity, generally of the mono-isotopic peak [34]. Even though peak areas have been shown to correlate with protein concentration [15], it has still not been assessed which of the two abundance measures is more appropriate for quantification, and there are examples of both methods being used in the literature.

Alignment & matching of spectraAfter peaks have been determined in the individual spectra, the question remains as to which peak in one LC-MS map repre-sents the same peptide as a peak in another map. Naturally, one would assume that specific peptides always occur at the same m/z and RT within each map. However, this is only true up to mea-surement errors. Regular calibration of the mass spectrometer ensures the variations in m/z to be small. On the other hand, the chromatographic system is usually very sensitive to variations in pressure or temperature and quickly shows distortions in the RT of peptides even between replicated analyses. Distortions in RT between samples may be up to minutes, impeding an easy and proper matching of signals across LC-MS runs. Hence, a great deal of effort has been made to create alignment algorithms that cor-rect the distortions in RT between samples and afterwards match corresponding peaks or features. Most algorithms need a coarse matching of peaks to begin with, which is often based on the more

accurate m/z values of features. The matched peaks are then used to define the deviations between the different maps. On the basis of the deviations found, the maps are corrected correspondingly. LC-MS maps are aligned to a reference sample [47,48] or all maps are aligned simultaneously [33,49].

First examples of alignment methods have made use of total ion chromatograms (TIC). Both dynamic time warping and the closely related correlation optimized warping have been used on TICs [50–52]. However, the usage of TIC data discards much of the information contained in the 2D LC-MS maps. Thus, methods incorporating the individual m/z values of peaks or raw data have become more common [47,48,53–55]. It was shown that the incor-poration of nonlinear alignment is superior to linear alignment methods [56]. Additional information from MS/MS identifications has been used to perform semi-supervised RT alignment [57].

A recent and comprehensive review of alignment methods can be found in [58].

Normalization of peak abundancesTechnical variations between and within experiments may result in ion abundance artifacts that complicate the direct compari-son of abundances. For example, ion intensities may generally be higher in one sample than in another. Also, abundances might decrease or increase with RT due to variations in the LC column. Normalization methods aim to nullify such arbitrary but systematic variations and thus making abundances comparable both within and across samples. At the same time, true biological differences existing between samples have to be kept untouched. Many nor-malization methods have been introduced for DNA microarrays [59] that might either be directly usable or adaptable to LC-MS data. Often, global normalization methods are used to normalize mean or median ion counts of the whole datasets [35]. Comparisons of several normalization methods have shown that linear regression normalization methods work well for LC-MS data [60,61]. Kultima et al. have additionally incorporated run order in the normalization procedure [61].

A problem that has not been given much attention yet is that many of the more sophisticated methods from DNA microarrays need complete data. In particular, lower abundant peaks are gen-erally not found in all LC-MS maps and, hence, normalization methods used for this kind of data have to be able to handle many missing values. Alternatively, methods from DNA microarrays for the estimation and imputation of missing data might also be appli-cable [62]. Some of these concepts have also been already applied to proteomics data from difference gel electrophoresis (DIGE) [63,64]. However, missing values in label-free proteomics will generally be a multiple of the number of observed missing values in microarray experiments and, hence, it still needs to be assessed how well the individual methods perform on this kind of data.

Software solutions for complete preprocessing of LC-MS samplesBy now, several groups have also introduced complete pre-processing pipelines. Radulovic et al. have introduced a com-plete preprocessing workflow for classification and biomarker

www.expert-reviews.com 253

ReviewPeek a peak: a glance at statistics for quantitative label-free proteomics

detection in proteomic samples [38]. Li et al. performed pre-processing of LC-MS data to obtain a peptide versus sample array of peptide abundances [37]. Zhang et al. introduced Xalign [55], a software for the detection and alignment of peaks. Several statistics are incorporated to evaluate the qual-ity of individual LC-MS samples and detect possible outliers. XCMS [40] has been designed for metabolite profiling of LC-MS data and is programmed using the statistical software R [101]. The msInspect software offers algorithms for the preprocessing of both label-free and isotopic-labeling approaches and allows the integration of other novel algorithms [44]. The OpenMS proteomics pipeline (TOPP) is a toolbox comprising preprocess-ing methods for LC-MS data [34]. The individual components can be individually addressed and can be composed into com-plex pipelines. The software SuperHirn performs preprocess-ing to extract profiles of features and offers statistical tools for clustering and classification [46]. Finally, the Corra platform allows the integration of existing algorithms as well as statistical tools for the ana lysis of LC-MS data [65]. So far, SpecArray and SuperHirn as well as several R packages for statistical analyses from Bioconductor [66] have been integrated. For a detailed overview of quantification software both for label-free and labeling-based quantification, the reader may refer to [67].

Peptide & protein identificationThe ana lysis steps described so far result in lists of quantified features. In many applications (i.e., detection of biomarkers), the identity of such a feature is important in order to understand the biological function. Shotgun proteomics usually involves automated MS/MS mea-surement of the most abundant peaks in each mass spectrum. The mass spectrom-eter isolates the peptides of the specified mass, the so-called precursor mass. The iso-lated peptides are then further fragmented and a mass spectrum of the fragments is measured. This is the MS/MS spectrum and can then be used to infer the peptide’s sequence [68,69].

Identifying peptides via MS/MS spectraThe most common way to infer the peptide sequence from a fragment mass spectrum is to search protein sequence databases. Database search programs (e.g., SEQUEST [70], MASCOT [71], X!Tandem [72] or OMSSA [73]) compare the observed spectrum with usually tryp-tic theoretical spectra generated from the used sequence database. Different scoring functions are used to match the observed spectrum with the theoretical spectra consisting of the theoretical fragment masses. Scoring functions either assess

the similarity between the observed spectrum and theoretical spectra or reflect the probability for a match being a random match. Matching sequences, together with their corresponding scores, are documented for each MS/MS spectrum. Often, only the best scoring match is considered in the subsequent ana lysis.

A returned match can either be a true or a false positive. Thus, the natural question is how to identify true matches [74–76]. Unfortunately, the scores do not necessarily reflect the certainty for an identified match being correct. Target decoy strategies have been proposed to address this problem [77,78]. The database is appended with reversed, randomized or shuffled sequences of the original database and then searched. The false discovery rate (FDR), which is the portion of false positives among all positive identifications, can then be estimated on the basis of the decoy entries. Decoy entries are known to be false positives and have been used in several different ways to predict the number of false-positive matches. A FDR cutoff can then be set to control the portion of false positives among the accepted matches.

An alternative to decoy strategies is the concept of posterior error probabilities, which has been used to determine whether an obtained match is a true positive. The posterior error probability is a measure for an individual peptide and returns the probabil-ity of the match being an incorrect match. PeptideProphet™ models the distributions of correct and incorrect matches with an empirical Bayes approach and, hence, derives the probabil-ity for a peptide match being correct [79]. A comparison of the methods of posterior error probabilities and FDR can be found in Käll et al. [80].

Figure 2. Several scenarios for protein identification. (A) Two identifiable proteins, both having at least one unique peptide. (B) Proteins without unique peptides. Either proteins are not differentiable or proteins are subsets of a protein. All these proteins can be combined to one protein group. (C) The bottom protein cannot be unambiguously identified as its peptides are also present in identifiable proteins.

a)

Protein Unique peptide Non-unique peptide

A

B

C

Expert Rev. Proteomics 7(2), (2010)254

Review Podwojski, Eisenacher, Kohl et al.

Deriving protein identificationsWhile peptide identifications can be derived from the MS/MS spectra, the true point of interest is the protein a peptide belongs to. Mapping of peptide identification results to obtain protein identifications is thus necessary. This has also been called the ‘protein inference problem’ [24], and, owing to ambiguities, this problem is far from trivial.

The identif ication results depend on the used sequence database. The database comprises the search space of possible sequences and thus also restricts the possible matches. Available sequence databases vary both in terms of completeness and redundancy of database entries. For example, the National Center for Biotechnology Information database is highly redun-dant while the International Protein Index database has low redundancy [81]. The latter database even has representatives for highly similar sequences.

Even when different database entries represent distinct pro-teins, the associated peptides can well be redundant. Non-unique peptides (i.e., peptides that occur in more than one database entry) may derive from homologous proteins, protein isoforms and splicing variants or from the same domain of proteins being rather different as a whole. A single gene may result in hundreds of different proteins whose sequences might largely be identi-cal [82]. It is possible that only a few peptides are unique to one protein. It might even be the case that one protein is only a truncated form of another protein and thus does not have any unique peptide at all. A further problem is the under-sampling of MS/MS in shotgun proteomics approaches. As only a few MS/MS spectra are derived within each round of MS, many peptides in a complex protein sample will not be identified at all. Especially when a protein has only few unique peptides, it is unlikely that one of these unique peptides will be identified to unambiguously state the existence of the protein within the com-plex protein mixture. Also, some peptides are not measureable for technical reasons (e.g., are inefficiently ionized), resulting in the corresponding proteins being difficult to identify.

Nesvizhskii and Aebersold have proposed a set of guidelines to derive protein identification lists from MS/MS peptide identifi-cations [24]. They propose the reporting of proteins with unique peptide identifications and the grouping of indistinguishable pro-teins (i.e., those with no unique identifications). Typical scenarios in protein identification are depicted in Figure 2. The additional application of Occam’s razor then results in a minimal list of proteins accounting for all identified peptides [83].

Similarly to peptides, proteins may be falsely identified. As pro-teins that have been identified by several peptides are more likely to be true, a specified minimum number of identified peptides or a specified sequence coverage per protein is sometimes used as a criterion for retaining a protein identification. However, short proteins usually have less identifiable peptides and may thus be harder to detect by this approach. ProteinProphet™ is a Bayes approach estimating the probability of a protein identification being correct [84]. And, similar to peptide identifications, FDR methods can be used to build lists of identified proteins with a specified proportion of false-positive identification matches.

Label-free quantification through spectral countingAn alternative approach for label-free quantification makes direct use of the identification results from MS/MS spectra. Washburn and colleagues have discovered a relationship between the num-ber of MS/MS identifications assignable to a protein and the protein’s abundance in ‘multidimensional protein identification technology’ (MudPIT) experiments [85]. The automated selection of precursor masses for MS/MS ana lysis is biased towards higher abundant peptides. Exclusion lists are regularly used in order to avoid multiple MS/MS measurements of the same peptide. However, very abundant peptides are usually detectable across a wide RT range and are still sampled repeatedly. This results in the observed relation between counts and protein abundance.

The number of MS/MS identifications has since been termed ‘spectral counts’. First analyses have used spectral counts merely as a semiquantitative measure, drawing conclusions from the presence or absence of counts [86] or testing for differences between different samples on the basis of peptide counts per protein [87] but not quantifying the differences. Eventually, a spike-in experiment with known protein concentrations was used to ascertain the linear relationship between spectral counts and protein concentration over two orders of magnitude [20]. But even if the actual range of linearity is wider than yet shown, an individual protein of interest may not cover such a wide range of abundance. A good reproducibility of spectral counts and, thus, a high accuracy for measuring large changes between samples, could also be assessed. Moreover, Old et al. have found that spectral counts quantification agrees well with quantification through ion counts [21].

Alternative sampling approachesIn addition to spectral counts, several related sampling statistics have been proposed for label-free quantification. A comparison of spectral counts with peptide counts and sequence coverage found spectral counts to be superior to both other sampling statistics [88]. Furthermore, summed peptide scores have been proposed as a measure of abundance and have been shown to be linear to logarithm of concentration [89,90].

Modifications of the simple sampling statistics have been pro-posed to circumvent potential drawbacks inherent in the specific methods. For example, peptide counts have been adjusted by theoretically observable peptides in the protein abundance index (PAI) as a measure for relative abundance [19]. This measure supposedly eliminates the bias in sampling towards longer pro-teins due to their larger number of peptides. An exponentially modified version of PAI, the emPAI, has been shown to be proportional to protein abundance [91]. The emPAI has been used for absolute quantification of isotope-labeled samples but it may also be used as a relative quantification measure in a label-free ana lysis setting. A method called absolute protein expression index (APEX) for absolute quantification in label-free proteomics has also been introduced [23]. APEX corrects the observed spectral counts of a protein by the prior expectation of observing each peptide, the total sampling depth and the confidence in protein identification.

www.expert-reviews.com 255

ReviewPeek a peak: a glance at statistics for quantitative label-free proteomics

Processing of spectral count dataThe spectral count approach does not generally need any sig-nal preprocessing like the ion count approach. However, there is still one issue left for these data. The total number of identified MS/MS spectra and thus the total number of spectral counts may be different between different samples. This might be the case because some MS/MS spectra could not be identified. Often, these differences between samples can be corrected through normaliza-tion methods comparable to the ones for ion count data. So far, global normalization methods have mostly been used for spectral counting that normalize data to the total spectral counts [86,88].

Relative quantification through ratios & testingOnce abundance measures have been derived, they are used for the estimation of changes in abundances between different groups. Often, the ultimate goal is to find proteins that are discriminative between the groups. Abundance changes between samples are generally quantified relatively through the calculation of peptide or protein abundance ratios, where the mean abundance in one group is divided by the mean abundance of another group. Aside from the mere quantification of change, the significance of the change is usually calculated through appropriate statistical tests. A statistical test is used to estimate if a given hypothesis (e.g., there is no difference in abundance between the different groups) holds true. Usually, a p-value is calculated that gives the probability of the test returning a false-positive result when accepting this change as being significant. A p-value cut-off can then be used to discriminate significant from insignificant abundance changes.

There are two types of errors that may occur when using a statistical test: the so-called a-error occurs when the test returns a positive test result, if the truth is not positive. The b-error, on the other hand, occurs if a test returns a negative test result, even though in truth there is a significant change. Possible test results and deduced statistical measures that can be used to evaluate a statistical test are summarized in Table 1.

The traditional statistical test for difference detection between two groups is the two-sample t-test. Assuming the data are derived from a normal distribution, the t-test calculates the prob-ability of the abundances being indeed different between the two groups. Aside from the assumption of normally distributed data, the t-test requires multiple samples in each group in order to estimate standard deviations used in the probability calculations.

Generally, the abundance measures derived from label-free LC-MS data cannot readily be assumed to be normally distributed. Peak areas as well as peak heights are usually restricted to positive values or even values above a certain noise threshold. The usage of the log transformation has already been introduced for microarray

experiments and many of the considerations made for that type of data also hold true for proteomics data [92]. The general notion is that log-transformed abundances are approximately normal.

An alternative way of testing for differences between groups when data are not normally distributed is to use nonparametric tests. The two-sample Kolmogorov-Smirnov test (K-S-test) is fre-quently used for difference detection when data are not normal. Also, permutation tests have become very popular. These tests per-mute the group labels and recalculate the test statistic in order to approximate the distribution and derive data-dependent p-values.

When an experiment consists of more than two groups, tests for differences between multiple groups can be used. Analysis of variance is a parametric method again based on certain para-metric assumptions (e.g., normally distributed errors) while the Kruskal-Wallis test is a nonparametric alternative.

Multiple testingGenerally, a statistical test is performed for each peptide or protein ratio in the experiment, resulting in hundreds or thousands of tests. For each test, there is a certain probability of it being false positive. When more tests are performed, the number of expected false-positive test results increases dramatically. Hence, a correc-tion for multiple testing is needed in order to limit the number of false-positive test results. Several methods have been introduced for this purpose. The first is the very conservative Bonferroni cor-rection. This method controls the family-wise error rate across all tests, which is defined as the probability of observing at least one false-positive test result. A less strict possibility is to control the FDR. A modified p-value, called q-value, has been introduced and can be used to control the FDR [93]. DNA microarray experiments regularly make use of q-values and initial examples of FDR usage have also been introduced for proteomics studies [94,95].

Using ion count abundance measures for differential proteomicsA general problem in proteomics experiments is the appearance of missing values. These missing values are seen when some features cannot be matched to features in other maps. The peptides could either be not present in the sample or could have not been detected during the peak detection process. Different workflows have been applied in response to this problem. One possible way is to simply state the value missing. Alternatively it is possible to go back to the raw LC-MS data and re-examine the ion intensities at the antici-pated m/z and RT of the missing feature to fill in the missing values. Admittedly, it might be possible that the corresponding feature is actually not present in the sample or lost within the technical noise and the question remains of how such features should be handled.

Table 1. Possible results from statistical tests and measures to assess a statistical testing procedure.

Reality Test result Measures

Positive Negative

Positive True positive False negative (b-error) Sensitivity = true positive/(true positive + false negative)

Negative False positive (a-error) True negative Specificity = true negative/(true negative + false negative)

Expert Rev. Proteomics 7(2), (2010)256

Review Podwojski, Eisenacher, Kohl et al.

A further issue in quantifying peaks or features is that a fea-ture reflects a peptide or even only one charge state of a specific peptide. Hence, a ratio or test for differences only detects the difference between samples on the peptide level. The actual point of interest, however, is usually the quantity or abundance ratio of the protein a peptide belongs to. When MS/MS spectra have been measured and identified, it is possible to match the identifications to features on the basis of m/z value and RT of the precursor ion. This ultimately allows the matching of different features to the protein they derive from. To derive abundance ratios for the whole protein, the corresponding peptide or feature ratios have to be combined.

Old et al. have proposed to use methods from stable-isotopic labeling to derive protein ratios from peptide ratios [21]. In stable-isotopic labeling, protein ratios are typically derived by weighted averaging of peptide ratios [27,28]. Also, an estimation of standard deviation for the protein ratio can be derived from peptide ratios. Sometimes, outliers are detected and removed before averaging by using the Dixon’s Q-test.

The actual concern is whether the protein abundance is sig-nificantly different between the different groups. When peptide ratios have been combined to protein ratios, there is, however, no protein abundance available for the different sample groups but only the relative measure of the ratio. In this case, the one-sample t-test can be used. The usage of the logarithm of abundance ratios is advisable. Ratios are also restricted to positive values, thus the distribution of ratios is skewed. The logarithmic transformation removes the skewness and again ensures approximately normally distributed values [21]. The t-test evaluates whether the log protein ratio is significantly different from zero, being synonymous for no change in protein abundance between the groups.

Using spectral count for differential proteomicsSimilar to the considerations for ion count procedures, a protein may only be identified in a subset of the samples. Such missing values can either be due to the protein not being present or being present in only very low abundance in the corresponding samples. On the other hand, it could be that the protein is present but not detected because its peptides are masked by more abundant co-eluting peptides. In the first case, giving a spectral count of zero would be an appropriate measure. In fact, this is what is usually taken as a measure. In the latter case, however, counting zero would be a considerable underestimation of the true (rela-tive) abundance. In this case, it might be more appropriate to return a missing value. So far, no comprehensive evaluation of this problem has been undertaken and, thus, the best way to handle unidentified proteins in sample subsets cannot be definitely stated.

In contrast to the ion count approach, spectral counts are usually directly evaluated for the whole protein and not on the individual peptides. Thus, the two-sample t-test may be applied.

A comparison of the t-test with tests for independence (e.g., Fisher’s exact test and G-test) has shown superiority of the t-test regarding the false-positive rate [88]. For the Fisher’s exact test and the G-test, each spectral count is considered as a result from a Bernoulli experiment, where the count either belongs to the

protein under consideration or does not belong to it. This way, neither test requires true biological replicates. Owing to this, the G-test has also been used elsewhere [21,96]. Yet, depending on the biological sample, high variations can be seen even for samples from the same conditions. Hence, biological experi-ments require replicates to ensure reliable and unbiased abun-dance difference estimates, and tests for independence should be handled with care.

One spectral count approach, conceptually similar to the pro-tein ratio derivation from ion count, was introduced and com-pared with the standard spectral count approach [96]. For this approach spectral counts were calculated for each peptide indi-vidually. The protein ratio was then calculated as an average of the peptide ratios. The standard spectral count approach performed better, but this result is based only on the replicate-free G-test.

Another approach for evaluation of spectral count data models the counts to derive from a poisson distribution and estimates generalized linear mixed-effects models [22]. An advantage of this method is the easy extension of the model to several study designs, including multiple groups or time-course experiments.

Expert commentaryThe question remains which of the two quantification approaches for label-free LC-MS should be preferred. Spectral count and ion count procedures have been compared [21,96] and the spectral count method seemed to perform slightly better. However, both studies have noted that low numbers of spectral counts result in noisy ratios that should not be used for ana lysis. In particular, when FDR methods are used to derive peptide and protein lists, only a few proteins are likely to be identified with large numbers of counts. Thus, a comprehensive ana lysis cannot be guaranteed. Also, the results for spectral counts of the aforementioned analyses are based on the G-test, which does not make use of replicates. We believe that a comparison between ion count and spectral count should incorporate the same number of replicates and the same statistical test concepts (i.e., t-test) to make the results more comparable and to ensure that differences are not only due to the different procedures used. Generally, we believe that quantifica-tion should be based on several replicates and, hence, the t-test should be preferred over, for example, the G-test.

Overall, there will not be one method that is best for every possible application. And even more so, there is not one true preprocessing routine for LC-MS data. The choice of method should, therefore, be made with close consideration of the aims of the study and technical equipment used.

An advantage of spectral counting is that it does not need any preprocessing. Thus, it is a ready and easy to use method for quantification that may be implemented in the laboratory without evaluating and implementing preprocessing algorithms or com-plete software solutions. By contrast, the ion count method for quantification does need such preprocessing steps. Especially in larger studies, these steps will be quite time consuming. Different results are probable when different preprocessing algorithms are used. Thus, when comparing results from different studies, one always has to keep the preprocessing routine in mind.

www.expert-reviews.com 257

ReviewPeek a peak: a glance at statistics for quantitative label-free proteomics

On the other hand, ion count methods may theoretically be used for classification without the need of identification [74]. But even when identities of peptides and proteins are of importance, ion count methods have a further advantage. Identifications of peptides are not necessary in all samples as a peptide identification can be matched across spectra through the feature matching process.

In fact, if identification is part of the research procedure, the arising questions and problems are automatically interwoven with quantification results. However, the two problems have so far, with few exceptions, only been considered separately. The way the pro-tein inference problem is solved can especially affect quantification. Peptides may be attributable to more than one identified protein, and, hence, the ratio is in fact a combination of several ratios, one for each protein the peptide derives from. The simplest solution would be to remove the non-unique peptides from abundance cal-culations. However, especially when different protein isoforms with many shared peptides are identified in a LC-MS ana lysis, much of the information might be lost through this approach. Another ad hoc procedure has been proposed for spectral counts [22]: the possibility to weigh the abundances of non-unique peptides based on total spectral counts per protein has been reported but has not been evalu-ated. Another approach makes use of biologically related families of proteins with similar abundance ratios to incorporate non-unique peptides [97]. However, we believe that non-unique peptides will be more important when the corresponding proteins show opposite regulation or at least strong differences in regulation.

Herein, we want to propose a procedure to resolve the abun-dances for non-unique peptides to different proteins. The approach is mainly based on two assumptions. First, peptide ratios from the same protein are considered to be log-normal distributed around the true protein ratio. When a protein is only identified through unique peptides, the protein’s log ratio can thus be calculated as the mean of the corresponding peptide log ratios. A similar approach was introduced by Higgs et al. [45]. The next assumption is that the abundance for a non-unique peptide is the sum of abundances attributable to the different proteins the peptide belongs to. We then use a maximum-likelihood approach to apportion the measured abundances of a peptide to the individual proteins in such a way that the resulting ratios best fit the corresponding unique peptides ratios from the involved proteins. The solution to the optimization prob-lem is found through an expectation-maximization-like algorithm. The derived individual abundances per protein can afterwards be used for ratio calculations as described before. We have, to date, evaluated this approach on simulated data.

The consideration of peptide ratios and abundances is so far mainly used for ion count procedures, where measurements are derived for peptides. However, we believe that spectral count data could be handled in exactly the same way by using the individual spectral counts per peptide instead of the direct assembly to protein spectral counts.

Five-year viewThe establishment of quantitative label-free MS is an important step towards large-scale proteomics experiments. While small stud-ies with only few samples can only serve as an exploratory ana lysis,

larger studies could actually build the basis for validation of poten-tial biomarkers. Larger studies also allow the usage of classification methods. While there probably does not exist a single biomarker discriminatory for each disease, larger sets of markers hold the potential of being discriminatory. Classification methods would possibly allow the detection of such biomarker sets. A method that might be useful in this context is the suggestion to quantify LC-MS data and perform differential ana lysis on ion counts before identification. An additional round of LC-MS sample ana lysis can then be used for targeted MS/MS analyses at the m/z and RT values that have before been significantly regulated. This way, the targeted identification of interesting peptides is possible.

Also, protein sequence coverage could be increased through a semi-targeted approach. Once a protein has been identified in a first experiment, a second experiment could be used for a targeted search of the protein’s tryptic peptides to increase the confidence of the pro-tein’s presence or derive more exact quantitative measures. To reduce the complexity of such searches, the concept of proteotypic peptides can help for the restriction of necessary MS/MS analyses [98,99].

Furthermore, with increasingly larger studies and data from different databases, it will become possible to compare data from different types of ‘omics’ (e.g., genomics, proteomics and metabolomics). The integration of these studies, together with very exact quantification results, will also enable the usage of systems biology approaches. Such approaches have the potential of explaining disease mechanisms and are hence of high interest.

Another aspect in quantitative proteomics that we believe will increase in importance over the next few years is the ana lysis of post-translational modifications. These modifications are respon-sible for the activation or inactivation of proteins and, hence, contain further important information. So far, post-translational modifications are often mainly identified, but quantification will become more and more important if disease mechanisms shall be truly understood.

Of course, technical developments are still ongoing both for high-pressure LC and mass spectrum devices. Robustness and reproduc-ibility are still improving with every new generation of instruments. Mass accuracy of new instruments will shortly reach the parts-per billion region and more robust LC systems will allow better matching of peptides across samples. However, with better technical devices at hand, preprocessing methods will need to be adjusted and possibly new methods will have to be developed in order to allow ana lysis pipelines to grow with the future developments and needs.

Financial & competing interests disclosureProteomics Data Collection (ProDaC) was funded as a Coordination Action by the European Commission (6th framework programme, project number LSHG-CT-2006–036814). National Genome Research Network (NGFN) plus is funded by the Bundesministerium für Bildung und Forschung (BMBF), grant 01 GS 08143. This work is further funded by Cluster Industrielle Biotechnologie (CLIB ) – contract number 616 40003 0315413B. The authors have no other relevant affiliations or financial involvement with any organiza-tion or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

No writing assistance was utilized in the production of this manuscript.

Expert Rev. Proteomics 7(2), (2010)258

Review Podwojski, Eisenacher, Kohl et al.

ReferencesPapers of special note have been highlighted as:• of interest•• of considerable interest

1 Conrad DH, Goyette J, Thomas PS. Proteomics as a method for early detection of cancer: a review of proteomics exhaled breath condensate, and lung cancer screening. J. Gen. Intern. Med. 23 Suppl. 1, 78–84 (2008).

2 Cravatt BF, Simon GM, Yates JR 3rd. The biological impact of mass-spectrometry-based proteomics. Nature 450(7172), 991–1000 (2007).

3 Rifai N, Gillette MA, Carr SA. Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat. Biotechnol. 24(8), 971–983 (2006).

4 Sawyers CL. The cancer biomarker problem. Nature 452(7187), 548–552 (2008).

5 Monteoliva L, Albar JP. Differential proteomics: an overview of gel and non-gel based approaches. Brief. Funct. Genomic Proteomic 3(3), 220–239 (2004).

6 Klose J, Kobalz U. Two-dimensional electrophoresis of proteins: an updated protocol and implications for a functional analysis of the genome. Electrophoresis 16(6), 1034–1059 (1995).

7 Klose J. Large-gel 2-D electrophoresis. Methods Mol. Biol. 112, 147–172 (1999).

8 Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature 422(6928), 198–207 (2003).

9 Wang M, You J, Bemis KG, Tegeler TJ, Brown DP. Label-free mass spectrometry-based protein quantification technologies in proteomic analysis. Brief. Funct. Genomic Proteomic 7(5), 329–339 (2008).

10 Ong SE, Mann M. Mass spectrometry-based proteomics turns quantitative. Nat. Chem. Biol. 1(5), 252–262 (2005).

11 Moritz B, Meyer HE. Approaches for the quantification of protein concentration ratios. Proteomics 3(11), 2208–2220 (2003).

12 Putz S, Reinders J, Reinders Y, Sickmann A. Mass spectrometry-based peptide quantification: applications and limitations. Expert Rev. Proteomics 2(3), 381–392 (2005).

13 Hebeler R, Oeljeklaus S, Reidegeld KA et al. Study of early leaf senescence in Arabidopsis thaliana by quantitative proteomics using reciprocal 14N/15N labeling and difference gel electrophoresis. Mol. Cell Proteomics 7(1), 108–120 (2008).

14 Wiese S, Reidegeld KA, Meyer HE, Warscheid B. Protein labeling by iTRAQ: a new tool for quantitative mass spectrometry in proteome research. Proteomics 7(3), 340–350 (2007).

15 Chelius D, Bondarenko PV. Quantitative profiling of proteins in complex mixtures using liquid chromatography and mass spectrometry. J. Proteome Res. 1(4), 317–323 (2002).

•• Proofofprinciplethatpeakareasfromioncountdatacanbeusedasameasureforpeptideabundance.

16 Wiener MC, Sachs JR, Deyanova EG, Yates NA. Differential mass spectrometry: a label-free LC-MS method for finding significant differences in complex peptide and protein mixtures. Anal. Chem. 76(20), 6085–6096 (2004).

17 Listgarten J, Emili A. Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry. Mol. Cell Proteomics 4(4), 419–434 (2005).

18 Silva JC, Denny R, Dorschel CA et al. Quantitative proteomic analysis by accurate mass retention time pairs. Anal. Chem. 77(7), 2187–2200 (2005).

19 Rappsilber J, Ryder U, Lamond AI, Mann M. Large-scale proteomic analysis of the human spliceosome. Genome Res. 12(8), 1231–1245 (2002).

20 Liu H, Sadygov RG, Yates JR 3rd. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal. Chem. 76(14), 4193–4201 (2004).

•• Firststudytoshowthatspectralcountsarelinearlycorrelatedwithproteinconcentration.

21 Old WM, Meyer-Arendt K, Aveline-Wolf L et al. Comparison of label-free methods for quantifying human proteins by shotgun proteomics. Mol. Cell. Proteomics 4(10), 1487–1502 (2005).

22 Choi H, Fermin D, Nesvizhskii AI. Significance analysis of spectral count data in label-free shotgun proteomics. Mol. Cell. Proteomics 7(12), 2373–2385 (2008).

23 Lu P, Vogel C, Wang R, Yao X, Marcotte EM. Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat. Biotechnol. 25(1), 117–124 (2007).

24 Nesvizhskii AI, Aebersold R. Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics 4(10), 1419–1440 (2005).

• Detaileddescriptionoftheproblemofinferringproteinidentificationsfrompeptideidentifications.Guidelinesforderivingandreportingofminimalproteinlists.

25 Kohl M, Redlich G, Eisenacher M et al. Automated calculation of unique peptide sequences for unambiguous identification of highly homolgous proteins by mass spectrometry. J. Proteomics Bioinform. 1(1), 6–10 (2008).

Key issues

• There are two approaches to label-free quantification: either peptide ion count or spectral count.

• Label-free quantification is generally relative (ratios between different conditions).

• For ion count approaches, comprehensive preprocessing is necessary; solutions for either the whole process or individual tasks are available.

• Identification of spectra is not necessary to make use of quantified features, but is desirable in many applications.

• Spectral count is a ready-to-use method for label-free quantification, only requiring peptide and protein identification results.

• Both abundance measures and ratios can be assumed to be approximately log-normally distributed; this should be considered when using statistical tests.

• The use of replicates is highly recommended in order to detect true differences between sample groups.

• Protein quantification depends on the protein identification process.

• Ratios of non-unique peptides should be apportioned among proteins.

www.expert-reviews.com 259

ReviewPeek a peak: a glance at statistics for quantitative label-free proteomics

26 Alexandridou A, Tsangaris GT, Vougas K, Nikita K, Spyrou G. UniMaP: finding unique mass and peptide signatures in the human proteome. Bioinformatics 25(22), 3035–3037 (2009).

27 Li XJ, Zhang H, Ranish JA, Aebersold R. Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry. Anal. Chem. 75(23), 6648–6657 (2003).

28 MacCoss MJ, Wu CC, Liu H, Sadygov R, Yates JR 3rd. A correlation algorithm for the automated quantitative analysis of shotgun proteomics data. Anal. Chem. 75(24), 6912–6921 (2003).

29 Domon B, Aebersold R. Mass spectrometry and protein analysis. Science 312(5771), 212–217 (2006).

30 Petricoin EF, Ardekani AM, Hitt BA et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359(9306), 572–577 (2002).

31 Petricoin EF 3rd, Ornstein DK, Paweletz CP et al. Serum proteomic patterns for detection of prostate cancer. J. Natl Cancer Inst. 94(20), 1576–1578 (2002).

32 Lilien RH, Farid H, Donald BR. Probabilistic disease classification of expression-dependent proteomic data from mass spectrometry of human serum. J. Comput. Biol. 10(6), 925–946 (2003).

33 Listgarten J, Neal RM, Roweis ST, Wong P, Emili A. Difference detection in LC-MS data for protein biomarker discovery. Bioinformatics 23(2), e198–e204 (2007).

34 Kohlbacher O, Reinert K, Gropl C et al. TOPP – the OpenMS proteomics pipeline. Bioinformatics 23(2), e191–e197 (2007).

35 Wang W, Zhou H, Lin H et al. Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Anal. Chem. 75(18), 4818–4826 (2003).

36 Barclay VJ, Bonner RF, Hamilton IP. Application of wavelet transforms to experimental spectra: smoothing, denoising, and data set compression. Anal. Chem. 69(1), 78–90 (1997).

37 Li XJ, Yi EC, Kemp CJ, Zhang H, Aebersold R. A software suite for the generation and comparison of peptide arrays from sets of data collected by liquid chromatography–mass spectrometry. Mol. Cell. Proteomics 4(9), 1328–1340 (2005).

38 Radulovic D, Jelveh S, Ryu S et al. Informatics platform for global proteomic profiling and biomarker discovery using

liquid chromatography-tandem mass spectrometry. Mol. Cell. Proteomics 3(10), 984–997 (2004).

39 Noy K, Fasulo D. Improved model-based, platform-independent feature extraction for mass spectrometry. Bioinformatics 23(19), 2528–2535 (2007).

40 Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal. Chem. 78(3), 779–787 (2006).

41 Yasui Y, McLerran D, Adam BL et al. An automated peak identification/calibration procedure for high-dimensional protein measures from mass spectrometers. J. Biomed. Biotechnol. 2003(4), 242–248 (2003).

42 Morris JS, Coombes KR, Koomen J, Baggerly KA, Kobayashi R. Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics 21(9), 1764–1775 (2005).

43 Du P, Kibbe WA, Lin SM. Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics 22(17), 2059–2065 (2006).

44 Bellew M, Coram M, Fitzgibbon M et al. A suite of algorithms for the comprehensive analysis of complex protein mixtures using high-resolution LC-MS. Bioinformatics 22(15), 1902–1909 (2006).

45 Higgs RE, Knierman MD, Gelfanova V, Butler JP, Hale JE. Comprehensive label-free method for the relative quantification of proteins from biological samples. J. Proteome Res. 4(4), 1442–1450 (2005).

46 Mueller LN, Rinner O, Schmidt A et al. SuperHirn – a novel tool for high resolution LC-MS-based peptide/protein profiling. Proteomics 7(19), 3470–3480 (2007).

47 Lange E, Gropl C, Schulz-Trieglaff O et al. A geometric approach for the alignment of liquid chromatography-mass spectrometry data. Bioinformatics 23(13), i273–i281 (2007).

48 Christin C, Smilde AK, Hoefsloot HC et al. Optimized time alignment algorithm for LC-MS data: correlation optimized warping using component detection algorithm-selected mass chromatograms. Anal. Chem. 80(18), 7012–7021 (2008).

49 Wang P, Tang H, Fitzgibbon MP et al. A statistical method for chromatographic alignment of LC-MS data. Biostatistics 8(2), 357–367 (2007).

50 Nielsen N-PV, Carstensen JM, Smedsgaard J. Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. J. Chromatogr. A 805, 17–35 (1998).

51 van Nederkassel AM, Daszykowski M, Eilers PH, Heyden YV. A comparison of three algorithms for chromatograms alignment. J. Chromatogr. A 1118(2), 199–210 (2006).

52 Bylund D, Danielsson R, Malmquist G, Markides KE. Chromatographic alignment by warping and dynamic programming as a pre-processing tool for PARAFAC modelling of liquid chromatography–mass spectrometry data. J. Chromatogr. A 961(2), 237–244 (2002).

53 Suits F, Lepre J, Du P, Bischoff R, Horvatovich P. Two-dimensional method for time aligning liquid chromatography–mass spectrometry data. Anal. Chem. 80(9), 3095–3104 (2008).

54 Sadygov RG, Maroto FM, Huhmer AF. ChromAlign: a two-step algorithmic procedure for time alignment of three-dimensional LC-MS chromatographic surfaces. Anal. Chem. 78(24), 8207–8217 (2006).

55 Zhang X, Asara JM, Adamec J, Ouzzani M, Elmagarmid AK. Data pre-processing in liquid chromatography–mass spectrometry-based proteomics. Bioinformatics 21(21), 4054–4059 (2005).

56 Podwojski K, Fritsch A, Chamrad DC et al. Retention time alignment algorithms for LC/MS data must consider non-linear shifts. Bioinformatics 25(6), 758–764 (2009).

57 Fischer B, Grossmann J, Roth V et al. Semi-supervised LC/MS alignment for differential proteomics. Bioinformatics 22(14), e132–e140 (2006).

58 Vandenbogaert M, Li-Thiao-Te S, Kaltenbach HM et al. Alignment of LC-MS images, with applications to biomarker discovery and protein identification. Proteomics 8(4), 650–672 (2008).

• Verydetailedreviewonretentiontimealignmentmethods.

59 Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185–193 (2003).

60 Callister SJ, Barry RC, Adkins JN et al. Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. J. Proteome Res. 5(2), 277–286 (2006).

Expert Rev. Proteomics 7(2), (2010)260

Review Podwojski, Eisenacher, Kohl et al.

61 Kultima K, Nilsson A, Scholz B et al. Development and evaluation of normalization methods for label-free relative quantification of endogenous peptides. Mol. Cell. Proteomics 8(10), 2285–2295 (2009).

62 Troyanskaya O, Cantor M, Sherlock G et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001).

63 Jung K, Gannoun A, Sitek B et al. Statistical evaluation of methods for the analysis of dynamic protein expression data from a tumor study. REVSTAT Stat. J. 4(1), 67–80 (2006).

64 Jung K, Gannoun A, Sitek B et al. Analysis of dynamic protein expression data. REVSTAT Stat. J. 3(2), 99–111 (2005).

65 Brusniak MY, Bodenmiller B, Campbell D et al. Corra: computational framework and tools for LC-MS discovery and targeted mass spectrometry-based proteomics. BMC Bioinformatics 9, 542 (2008).

66 Gentleman RC, Carey VJ, Bates DM et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5(10), R80 (2004).

67 Mueller LN, Brusniak MY, Mani DR, Aebersold R. An assessment of software solutions for the analysis of mass spectrometry based quantitative proteomics data. J. Proteome Res. 7(1), 51–61 (2008).

•• Reviewfocusingonsoftwaresolutionsforquantificationofmassspectrometrydata.

68 Hunt DF, Buko AM, Ballard JM, Shabanowitz J, Giordani AB. Sequence analysis of polypeptides by collision activated dissociation on a triple quadrupole mass spectrometer. Biomed. Mass Spectrom. 8(9), 397–408 (1981).

69 Hunt DF, Yates JR 3rd, Shabanowitz J, Winston S, Hauer CR. Protein sequencing by tandem mass spectrometry. Proc. Natl Acad. Sci. USA 83(17), 6233–6237 (1986).

70 Eng JK, McCormack AL, Yates JR 3rd. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spetrom. 5, 976–989 (1994).

71 Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551–3567 (1999).

72 Fenyo D, Beavis RC. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 75(4), 768–774 (2003).

73 Geer LY, Markey SP, Kowalak JA et al. Open mass spectrometry search algorithm. J. Proteome Res. 3(5), 958–964 (2004).

74 Boguski MS, McIntosh MW. Biomedical informatics for proteomics. Nature 422(6928), 233–237 (2003).

75 Patterson SD. Data analysis – the Achilles heel of proteomics. Nat. Biotechnol. 21(3), 221–222 (2003).

76 Nesvizhskii AI, Vitek O, Aebersold R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 4(10), 787–797 (2007).

77 Reidegeld KA, Eisenacher M, Kohl M et al. An easy-to-use Decoy Database Builder software tool, implementing different decoy strategies for false discovery rate calculation in automated MS/MS protein identifications. Proteomics 8(6), 1129–1137 (2008).

78 Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4(3), 207–214 (2007).

79 Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74(20), 5383–5392 (2002).

• PeptideProphet™:methodforestimatingtheprobabilityofapeptidematchbeingcorrect.

80 Käll L, Storey JD, MacCoss MJ, Noble WS. Posterior error probabilities and false discovery rates: two sides of the same coin. J. Proteome Res. 7(1), 40–44 (2008).

81 Kersey PJ, Duarte J, Williams A et al. The International Protein Index: an integrated database for proteomics experiments. Proteomics 4(7), 1985–1988 (2004).

82 Black DL. Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell 103(3), 367–370 (2000).

83 Kohl M, Schönebeck B, May C et al. Protein List Comparator (ProLiC): a framework for comparison of protein lists. Clin. Proteomics 5(Suppl. 1, Poster C545), 104 (2009).

84 Nesvizhskii AI, Keller A, Kolker E, Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75(17), 4646–4658 (2003).

• ProteinProphet™:methodforestimatingtheprobabilityofaproteinmatchbeingcorrect.

85 Washburn MP, Wolters D, Yates JR 3rd. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 19(3), 242–247 (2001).

86 Pang JX, Ginanni N, Dongre AR, Hefta SA, Opitek GJ. Biomarker discovery in urine by proteomics. J. Proteome Res. 1(2), 161–169 (2002).

87 Gao J, Opiteck GJ, Friedrichs MS, Dongre AR, Hefta SA. Changes in the protein expression of yeast as a function of carbon source. J. Proteome Res. 2(6), 643–649 (2003).

88 Zhang B, VerBerkmoes NC, Langston MA et al. Detecting differential and correlated protein expression in label-free shotgun proteomics. J. Proteome Res. 5(11), 2909–2918 (2006).

89 Allet N, Barrillat N, Baussant T et al. In vitro and in silico processes to identify differentially expressed proteins. Proteomics 4(8), 2333–2351 (2004).

90 Colinge J, Chiappe D, Lagache S, Moniatte M, Bougueleret L. Differential proteomics via probabilistic peptide identification scores. Anal. Chem. 77(2), 596–606 (2005).

91 Ishihama Y, Oda Y, Tabata T et al. Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Mol. Cell Proteomics 4(9), 1265–1272 (2005).

92 Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentialy espressed genes in replicated cDNA microarray experiments. Stat. Sin. 12, 111–139 (2002).

93 Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA 100(16), 9440–9445 (2003).

94 Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Stat. Sci. 18(1), 71–103 (2003).

95 Jung K, Poschmann G, Podwojski K et al. Adjusted confidence intervals for the expression change of proteins observed in 2-dimensional difference gel electrophoresis. J. Proteomics Bioinform. 2(2), 78–87 (2009).

96 Xia Q, Wang T, Park Y, Lamont RJ, Hackett M. Differential quantitative proteomics of Porphyromonas gingivalis by linear ion trap mass spectrometry: non-label methods comparison, q-values and LOWESS curve fitting. Int. J. Mass Spectrom. 259(1–3), 105–116 (2007).

www.expert-reviews.com 261

ReviewPeek a peak: a glance at statistics for quantitative label-free proteomics

97 Jin S, Daly DS, Springer DL, Miller JH. The effects of shared peptides on protein quantitation in label-free proteomics by LC/MS/MS. J. Proteome Res. 7(1), 164–169 (2008).

98 Kuster B, Schirle M, Mallick P, Aebersold R. Scoring proteomes with proteotypic peptide probes. Nat. Rev. Mol. Cell. Biol. 6(7), 577–583 (2005).

99 Mallick P, Schirle M, Chen SS et al. Computational prediction of proteotypic peptides for quantitative proteomics. Nat. Biotechnol. 25(1), 125–131 (2007).

Website

101 R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 3–900051–07–00 www.R-project.org

Affiliations• Katharina Podwojski

Medizinisches Proteom-Center, Ruhr-Universität Bochum, Zentrum für Klinische Forschung (ZKF 1), Universitätsstraße 150, 44801 Bochum, Germany Tel.: +49 234 322 9288 Fax: +49 234 321 4554 [email protected]

• Martin Eisenacher Medizinisches Proteom-Center, Ruhr-Universität Bochum, Zentrum für Klinische Forschung (ZKF 1), Universitätsstraße 150, 44801 Bochum, Germany Tel.: +49 234 322 9288 Fax: +49 234 321 4554 [email protected]

• Michael Kohl Medizinisches Proteom-Center, Ruhr-Universität Bochum, Zentrum für Klinische Forschung (ZKF 1), Universitätsstraße 150, 44801 Bochum, Germany Tel.: +49 234 322 9288 Fax: +49 234 321 4554 [email protected]

• Michael Turewicz Medizinisches Proteom-Center, Ruhr-Universität Bochum, Zentrum für Klinische Forschung (ZKF 1), Universitätsstraße 150, 44801 Bochum, Germany Tel.: +49 234 322 9275 Fax: +49 234 321 4554 [email protected]

• Helmut E Meyer Professor, Medizinisches Proteom-Center, Ruhr-Universität Bochum, Zentrum für Klinische Forschung (ZKF 1), Universitätsstraße 150, 44801 Bochum, Germany Tel.: +49 234 322 2427 Fax: +49 234 321 4554 [email protected]

• Jörg Rahnenführer Professor, Fachgebiet Statistische Methoden in der Genetik und Chemometrie, Fakultät Statistik, Technische Universität Dortmund, 44221 Dortmund, Germany Tel.: +49 231 755 3121 Fax: +49 231 755 5303 [email protected]

• Christian Stephan Medizinisches Proteom-Center, Ruhr-Universität Bochum, Zentrum für Klinische Forschung (ZKF 1), Universitätsstraße 150, 44801 Bochum, Germany Tel.: +49 234 322 9288 Fax: +49 234 321 4554 [email protected]