New data preprocessing trends based on ensemble of ...

13
HAL Id: hal-02968798 https://hal.inrae.fr/hal-02968798 Submitted on 16 Oct 2020 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Distributed under a Creative Commons Attribution - NonCommercial - NoDerivatives| 4.0 International License New data preprocessing trends based on ensemble of multiple preprocessing techniques Puneet Mishra, Alessandra Biancolillo, Jean-Michel Roger, Federico Marini, Douglas Rutledge To cite this version: Puneet Mishra, Alessandra Biancolillo, Jean-Michel Roger, Federico Marini, Douglas Rutledge. New data preprocessing trends based on ensemble of multiple preprocessing techniques. Trends in Analyt- ical Chemistry, Elsevier, 2020, 132, pp.116045. 10.1016/j.trac.2020.116045. hal-02968798

Transcript of New data preprocessing trends based on ensemble of ...

HAL Id: hal-02968798https://hal.inrae.fr/hal-02968798

Submitted on 16 Oct 2020

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Distributed under a Creative Commons Attribution - NonCommercial - NoDerivatives| 4.0International License

New data preprocessing trends based on ensemble ofmultiple preprocessing techniques

Puneet Mishra, Alessandra Biancolillo, Jean-Michel Roger, Federico Marini,Douglas Rutledge

To cite this version:Puneet Mishra, Alessandra Biancolillo, Jean-Michel Roger, Federico Marini, Douglas Rutledge. Newdata preprocessing trends based on ensemble of multiple preprocessing techniques. Trends in Analyt-ical Chemistry, Elsevier, 2020, 132, pp.116045. �10.1016/j.trac.2020.116045�. �hal-02968798�

lable at ScienceDirect

Trends in Analytical Chemistry 132 (2020) 116045

Contents lists avai

Trends in Analytical Chemistry

journal homepage: www.elsevier .com/locate/ t rac

New data preprocessing trends based on ensemble of multiplepreprocessing techniques

Puneet Mishra a, *, Alessandra Biancolillo b, Jean Michel Roger c, d, Federico Marini e,Douglas N. Rutledge f, g

a Wageningen Food and Biobased Research, Bornse Weilanden 9, P.O. Box 17, 6700AA, Wageningen, the Netherlandsb University of L'Aquila, Department of Physical and Chemical Sciences, Via Vetoio, 67100, Coppito, L'Aquila, Italyc ITAP, INRAE Montpellier Institut Agro, University Montpellier, Montpellier, Franced ChemHouse Research Group, Montpellier, Francee Department of Chemistry, University of Rome “La Sapienza", Piazzale Aldo Moro 5, 00185, Rome, Italyf Universit�e Paris-Saclay, INRAE, AgroParisTech, UMR SayFood, 75005, Paris, Franceg National Wine and Grape Industry Centre, Charles Stuart University, Wagga Wagga, Australia

a r t i c l e i n f o

Article history:Available online 19 September 2020

Keywords:Multivariate calibrationEnsemble learningMulti-block analysisChemometricsPreprocessing

* Corresponding author.E-mail address: [email protected] (P. Mishra)

https://doi.org/10.1016/j.trac.2020.1160450165-9936/© 2020 The Author(s). Published by Elsevie).

a b s t r a c t

Data generated by analytical instruments, such as spectrometers, may contain unwanted variation due tomeasurement mode, sample state and other external physical, chemical and environmental factors.Preprocessing is required so that the property of interest can be predicted correctly. Different correctionmethods may remove specific types of artefacts while still leaving some effects behind. Using multiplepreprocessing in a complementary way can remove the artefacts that would be left behind by using onlyone technique. This article summarizes the recent developments in new data preprocessing strategiesand specifically reviews the emerging ensemble approaches to preprocessing fusion in chemometrics. Ademonstration case is also presented. In summary, ensemble preprocessing allows the selection ofseveral techniques and their combinations that, in a complementary way, lead to improved models.Ensemble approaches are not limited to spectral data but can be used in all cases where preprocessing isneeded and identification of a single best option is not easily done.© 2020 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND

license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction

In analytical chemistry, multivariate data are generally acquiredfrom different measurement techniques for qualitative and quan-titative analysis [1,2]. The techniques can range from implementingminiature near-infrared (NIR) spectrometers [3,4] for measuringoptical properties to high end complex and costly techniques suchas liquid or gas chromatography -mass spectrometry (LC-MS or GC-MS) or even up to their multidimensional implementations (e.g.,GCxGC/MS) [5,6]. As a consequence, the data generated by thesetechniques can range from simple multivariate spectral data (as inthe case of NIR) to multi-mode chromatograms (in the case of LC/GC-MS). However, what the data from multiple techniques havein common is that they all suffer from artefacts (unwanted varia-tion) [7]. The presence of these artefacts can be ascribed to severalcauses, such as the measurement modality, instrumental drifts,

.

r B.V. This is an open access article

sample state and other external physical, chemical and environ-mental factors [8]. As an example, one could think of NIR mea-surements performed in diffuse reflection mode on highlyscatteringmaterials [9]. In such a case, the acquired NIR spectrawillbe a combination of absorption and scattering characteristics. Theabsorption phenomenon will translate to the presence of peaks atspecific wavelengths whereas the scattering will manifest itself asadditive effects, mainly modifying the baseline, and multiplicativeeffects which dramatically affect the linear models. Additionally,when the collected spectra are used for calibration, the attenuationin the signals brought by scattering may also affect model quality,not only in terms of predictive accuracy but also of interpretability[10]. As a consequence, it is obvious that, under the above-mentioned circumstances, one should always perform scattercorrection in order for the spectra to resemble as closely as possiblethe absorption profile so that the corresponding models are basedon such information only [10,11].

Data preprocessing, in general, is required to correct for scat-tering, baselines changes, peak shifts, noises, missing values andseveral other artefacts so that the “true” chemically-relevant

under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/

Abbreviations

AsLS Asymmetric Least SquareDoE Design of ExperimentsGAPLS Genetic Algorithm Partial Least SquaresKDR Known Data RegressionLC/GC-MS Liquid Chromatography/Gas Chromatography

Mass SpectrometryMCR Multivariate Curve ResolutionMSC Multiplicative Scatter CorrectionNIR Near InfraredNMR Nuclear Magnetic ResonanceOPLS Orthogonal Partial Least SquaresPAT Process Analytical TechnologiesPCR Principal Component RegressionPLSR Partial Least Square RegressionRNV Robust Normal VariateSNR Signal to NoiseSNV Standard Normal VariateSPORT Sequential Preprocessing Through

OrthogonalizationSS-DAC Soft Sensor Development, Assessment and

ComparisonTSR Trimmed Scores RegressionVSN Variable Sorting for Normalization

P. Mishra, A. Biancolillo, J.M. Roger et al. Trends in Analytical Chemistry 132 (2020) 116045

underlying structure can be highlighted and/or, if required, theproperty of interest can be predicted correctly [7,10,12]. Thanks tothe extensive chemometric researches in the last decades, severalpreprocessing techniques are now available [8]. However, there areno clear rules to decide when to use a specific preprocessingtechnique, if a single technique could be enough and, if not, whichtechniques to combine and how (e.g., in what order) [7,13]. Forinstance, several techniques are available to perform scattercorrection, such as standard normal variate (SNV), multiplicativescatter correction (MSC) and their several variants, calculation of2nd derivative (which, depending on the algorithm, requiressetting additional parameters), or the recently introduced variablesorting for normalization (VSN) [8]. Accordingly, to identify whichtechnique could be the best for her/his data, the user is oftenrequired to explore all the available options. Moreover, if there isthe need to associate scatter correction with other operations, e.g.,noise removal and scaling, the number of combinations to beexplored increases exponentially. To overcome this problem,several approaches related to the identification of the best pre-processing and/or of their optimal combination were also devel-oped [14e17]. In this context, until recently, most of the attention inchemometrics was limited to finding the best preprocessing [15] ortheir optimal combination [14] and almostminimal focus was givento understanding what complementary information the differentpreprocessing techniques carry.

Preprocessing methods always carry with them the risk of alsoremoving relevant chemical information or variation that is relatedto the property of interest [7,13,14]. On the other hand, differentpreprocessing techniques may remove certain types of artefactswhile still leaving some other effects behind. Using multiple pre-processing in a complementary way can remove the artefacts thatwould be left behind by using only one technique. Thanks to recentdevelopments in ensemble [18,19] and data fusion methods [20],the complementary fusion of multiple preprocessing techniques isnow possible. The complementary fusion of preprocessing tech-niques and their combinations has twomajor benefits: the first one

2

is that it allows all the complementary information linked todifferent preprocessing techniques and their combinations to beused synergistically to developmodels, and the second one is that ittakes the user out of the loop of searching for the best pre-processing and their combinations. Several recent works haveshown that the complementary fusion of preprocessing techniquescan lead to high-quality models [19e21].

This article summarizes the recent developments in new datapreprocessing strategies and specifically reviews the emergingensemble approaches to preprocessing fusion in chemometrics. Ademonstration case of using an ensemble preprocessing fusionapproach called sequential preprocessing through orthogonaliza-tion (SPORT) is also presented.

2. Artefacts in data: background causes

Several artefacts may be present in multivariate data obtainedfrom analytical techniques. Broadly, they can be classified in fivemain categories, i.e., missing data, noise, baselines shifts, multi-plicative effects and peak shifts. Each artefact has its own back-ground cause(s), which, in the case of spectroscopy, can range froma human error during measurements to a complex interaction oflight with the physical structure of the sample. In the remainder ofthis Section, the various typologies of artefacts and their causes willbe briefly discussed.

Missing data refers to the lack of one or more entries in thematrix containing the experimental data. The presence of missingdata can be for several reasons, such as values outside the instru-ment range, instrument malfunctioning at a certain point of time,communication failure between the instrument and the digitalcontroller, multiple sensors deployed for the same task but atdifferent sampling rates, saturation of the signal intensity andinstrumental errors during data acquisition [22]. Apart from thesetechnical reasons, sometimes the missing values are due to humanerrors, and arise in cases when repeating a part of the measure-ments may be impossible or too expensive [23]. In Fig. 1, theoccurrence of missing data is represented by the red dotted line,which, compared to the original spectrum (solid blue line), lackscertain spectral bands.

Noise is the unwanted disturbances in a signal [24] and itscauses can be traced back to, e.g., the sensitivity of the detectors ofinstruments or, in the case of optical spectroscopy, the excitationsource (electromagnetic light source). The presence of noise isusually expressed as low signal-to-noise (SNR) ratio. Most often thenoise is visually identifiable by plotting the dataset. An example ofnoisy signal is shown in Fig.1 as a solid purple line: compared to theoriginal signal (solid blue), which does not present any noise, theperturbed spectrum is characterized by a disturbance at theextreme wavelengths. Such a type of noise is commonly encoun-tered in NIR data due to the low detectivity of the sensors at theextreme frequencies.

Variables with huge intensity differencesmay be encounteredin several analytical platforms such asmass spectrometry and NMR.However, such huge differences in signal intensities are rarelyencountered in the domain of optical spectroscopy. These are notreally artefacts but actually intrinsic characteristics of such datasets. In such cases, the aim is always to transform/normalize thevariables so to span a comparable range of variability.

Baselines are structured background effects (“continuous” andlow frequency) and the reasons for their occurrence in the signalsare, in general, dependent on the specific measurement techniqueinvolved. For example, in Raman spectroscopy, the main cause of anon-zero baseline is the background fluorescence of the samples,whereas in NIR, the baselines are caused by stray light and theinteraction of the light with the particles or droplets, leading to

P. Mishra, A. Biancolillo, J.M. Roger et al. Trends in Analytical Chemistry 132 (2020) 116045

scattering; on the other hand, in chromatography, the baselinecorresponds to the detector response when only the mobile phaseemerges from the column. The presence of a linear baseline isexemplified as a solid yellow line in Fig. 1, where it is evident thatsuch an effect results in an increasing difference in intensity withrespect to the original signal (solid blue).

Multiplicative effects are artefacts whose extent depends lin-early on the intensity of the original signal. They may be caused,among others, by physical phenomena, such as light scattering inthe case of optical spectroscopy, or by a non-completely repro-ducible sample manipulation or presentation (e.g., volume ofinjected sample in chromatography or dilution in nuclear magneticresonance). In the case of optical techniques, multiplicative effectsare caused by the forced deviation of the photons from a straighttrajectory by localized non-uniformities present in the samples. Inchromatography, multiplicative effects are related to the very slightvariations in the amount of sample being injected and analyzed inthe different runs [25]. In NMR-based metabolomics, especially inthe analysis of urine, multiplicative effects result from unspecificvariations of the overall concentrations of samples, due to differentdilutions. In Fig. 1, the dashed green line represents a signal withmultiplicative effects: when compared to the original profile (solidblue line), it is evident that the differences in intensity are morepronounced for those wavelengths where the unperturbed spec-trum has a higher signal.

Peak shifts are horizontal displacements in the signals which,otherwise, should ideally be aligned (i.e., present maxima at thesame frequencies, in case of spectroscopy, or retention times, in thecase of chromatography). Accordingly, there are two main types ofshifts, namely temporal and spectral shifts. Temporal shifts arecommon in chromatography where they can be due to the deteri-oration or aging of the stationary phase, matrix effects, temperaturechanges, changes in the mobile phase composition, instrumentaldrift, interactions between analytes, fluctuations in pressure andflow rates and presence of gas bubbles [26]. Spectral shifts are

Fig. 1. Example of several simulated artefacts for visible and near-infrared data. Data with no(solid cyan line), noise at extreme bands (solid purple line) and missing data (dotted red li

3

common in techniques like optical spectroscopy, NMR and massspectrometry, where they can be due to variations in temperature,pressure, viscosity and pH, but also for instrumental reasons [26].An example of peakmisalignment is shown in Fig.1 where the solidcyan line is the shifted version of the original spectrum (solid blue).When such effect is present, peak alignment is fundamental foroptimal modelling.

3. Objectives of data preprocessing

The global objective of data preprocessing is to remove theunwanted variability or effects from the signal so that the usefulinformation related to the property(-ies) of interest can be usedfor efficient modelling. The specific objectives of the preprocess-ing techniques are dependent on the type of artefacts to be dealtwith. Following the same classification reported in the previoussection, in the case of missing data, the objective of preprocessingis to estimate the missing values using data imputation ap-proaches, a detailed description of which can be found in adedicated review [23]. In the case of noisy data, the objective is toremove/reduce the noise, and this can be achieved throughseveral approaches such as manual removal of the noisy part ofthe signal or time series, use of filtering algorithms [27] orsmoothing functions such as polynomial fitting and spline inter-polation [28], and data reconstruction approaches such as prin-cipal/independent component reconstruction [8] or waveletthreshold and reconstruction [29]. In the case of the presence of anon-zero baseline, the objective is to identify the type/order of thebaseline, estimate it accordingly, and then subtract it from theoverall signal. Baseline modelling/removal can be achieved withapproaches such as offset corrections, detrending, asymmetricleast square (AsLS) and several others [8]. In the case of multi-plicative effects, the objective is to remove/reduce the global in-tensity differences in the signal either by the use of model-freeapproaches, such as normalization techniques [8], or with model-

unwanted effect (solid blue line), multiplicative effects (dashed green line), peak shiftsne), and baseline shift (solid brown line).

P. Mishra, A. Biancolillo, J.M. Roger et al. Trends in Analytical Chemistry 132 (2020) 116045

based strategies which use a reference signal to estimate themultiplicative effects, in order to remove them by mathematicaloperations [30]. In the case of peak shifts, the objective is to alignthe peaks along either the spectral or the time domain. The basicsteps of peak alignment are to identify a reference signal (signalwith respect to which other signals are aligned) and to warp theexperimental profiles along the time or spectral axis, so to be asmatched as possible to the reference. In order for alignment to beeffective, the target signal should present as many of the peakspresent in the experimental profile as possible. Moreover, theextent of matching between the signals and the target can belimited to key peaks, which can be identified by means of ap-proaches such as bucketing or binning and Landmark peak se-lection, or evaluated across the whole signal. Several approachesto perform peak alignment are available and can be found in arecent review [26]. A further detailed summary of all the mostcommonly used chemometric preprocessing techniques can befound elsewhere [8], together with the MATLAB codes for theirimplementation.

4. Recent advances in preprocessing methods

Preprocessing plays a fundamental role in chemometrics.However, there is no preprocessing technique which can beconsidered as a gold standard or, anyway, be blindly applied to anydata, irrespective of their nature; on the contrary, preprocessingtechniques are usually chosen based on the data source and thespecific artefacts. A summary of the preprocessing techniqueswhich, across the years, have become well established in chemo-metrics and are most frequently used, can be found elsewhere [8].Alongside with these, new preprocessing techniques, which aresummarized in Table 1, have been developed over the past decade.The details reported in Table 1 point out how new strategies haveemerged for the correction of all the types of artefacts previouslydiscussed, i.e. for missing data imputation, noise removal, baselinecorrection, multiplicative effect correction and peak alignment. Asfar as the imputation of missing data is concerned, new methodshave been proposed, considering the problems of both modelexploitation and model building in the presence of missing data. Inparticular, to deal with model building in the scenario of missingdata, methods which combine known data regression (KDR) andtrimmed scores regression (TSR) with principal componentregression (PCR) or partial least squares regression (PLSR) haveemerged [31]. A free toolbox to implement several novel strategiesfor missing data imputation can be found and accessed in Ref. [22].

In the case of noise removal, the new emerging trend is the useof autoencoders [32]. An autoencoder is a type of bottleneck arti-ficial neural network that is used to learn efficient data coding in anunsupervised manner. Indeed, an autoencoder is a multi-layerfeed-forward network, with at least three hidden layers, the in-termediate of which, that is the one providing the mapping, beingcharacterized by a rather small number of neurons. The network istrained using both as inputs and as targets the original signals, sothat, through the bottleneck architecture, the autoencoder extractsnon-linear salient features from the data, leaving the noiseunmodeled. The encoded data can then be decoded to reconstructthe original signal but without noise [32]. Another advancement innoise removal is related to the possibility of automating the task:for this purpose, automatic threshold selection for shearlet co-efficients was proposed in Ref. [29].

Baseline correction methods can broadly be divided into threesubgroups, i.e., polynomial fitting, wavelet transformations andderivatives. The new direction to baseline correction involvescombining wavelet decomposition, signal differentiation andbaseline fitting with penalized least squares [33]. The new strategy

4

combining multiple baseline correction approaches shows that thedifferent techniques carry complementary information and theircombination leads to improved removal of baselines from Ramanspectra [33]. Some new developments were related to theadvancement of adaptive reweighting strategies to support poly-nomial weighting and penalized least squares [34,35]. The mainbenefit of the adaptive reweighing approach is that it does notrequire user intervention and prior information. The multiplespectra baseline fitting approach is also new and uses multiplesignals to infer their common characteristics in order to learnslowly varying baselines [36]. The main aim of the multiple spectrabaseline correction approach is to learn baselines that performwellon the corresponding spectra and then co-regularize the choices bypenalizing the disagreements among the baseline corrected spectrabased on asymmetrical least squares [36]. Even if it doesn't involvethe actual correction of the signals, it is worth mentioning thatanother approach based on Tikhonov regularization was recentlyproposed to deal with the presence of a non-constant baseline inthe framework of regression model building [37]. In particular, theapproach is based on using the Tikhonov regularization extra termsin the linear calibration problem to force the regression coefficientsto be orthogonal to the direction(s) of the sample space spanned bythe baseline effects [37].

As far as the correction of multiplicative effects is concerned, themajor development was related to the introduction of the possi-bility of weighting the individual variables according to how likelythey are to be affected by the unwanted contribution to beremoved. In this context, a new method called variable sorting fornormalization (VSN), which utilizes a random consensus strategy toestimates the weights before implementing normalizations in adata-driven, hypothesis-free fashion, was recently proposed [38].Also are emerging local preprocessing approaches, which dividethe signal into parts and apply locally (i.e. individually on each ofthe parts) existing preprocessing methods, such as SNV [39]. Othernew methods, which combine different approaches to correct themultiplicative effects, have also been reported [40]. Robust variantsof traditional preprocessing methods are also gaining importancedue to their capability to deal with outliers, leading to optimizedpreprocessing of data [16]. For peak alignment, an increasingnumber of approaches are appearing in the literature, adopting theconcept of point matching inspired by the computer vision domain[41,42]. Another emerging trend is the combination of datadecomposition methods, such as multivariate curve resolution(MCR), with wrapping approaches [43,44]. Recent studies showthat nowadays the focus is on automated methods which do notrequire reference signals and perform global alignments to dealwith data rich in features [41,42,45e47].

5. Preprocessing selection approaches and their limitations

The choice of preprocessing techniques can greatly affect themodel performance and preprocessing selection by the user canvery often be sub-optimal as the user cannot directly explore allpreprocessing techniques and their combinations. In the beginning,the selection of the preprocessing approach was based on trial anderror. However, with the advancement in the computing power andthe evolution of chemometrics, in recent years, several approachesfor the selection and optimization of preprocessing strategies havebeen proposed (Table 2). For example, a novel approach based onorthogonal PLS (OPLS) was proposed to evaluate the performancesof individual preprocessing techniques and their combination. TheOPLS-based approach decomposes the data into common anddistinct variation, allowing to evaluate which part of the originalinformation is retained and which new patterns emerge in the dataafter a preprocessing technique is applied [15]. On the other hand,

Table 1Summary of the strategies for preprocessing developed in the past decade in chemometrics.

Artefacts type Techniques Background principle Novel features References

Missing data � Known data regression with principalcomponents regression (KDR-PCR) orpartial least squares regression (KDR-PLSR)

� Trimmed scores regression withprincipal components regression(TDR-PCR) or partial least squaresregression (TDR-PLSR)

� A particular case of generalizedregression model

� The regressionmodel is fitted and themissing part is estimated

� Loadings obtained from PCA or PLSRdepending on the methodimplemented

� Allows model development andapplication in presence of missingdata

� Both data visualization andpredictive analysis in presence ofmissing data is possible

[22,31]

Maximum likelihood PCA based dataimputation

� Assigns high variance to the missingvalues prior to PCA allowing to fitPCA model by disregarding themissing points

� Allows PCA in presence of missingvalues

[48]

Noise removal Autoencoders � Bottleneck neural network for theextraction of non-linear features(non-linear data compression)

� Automatic noise removal [32]

Shearlet-based denoising � Automatic thresholding for shearletcoefficients

� Reconstruction of signal afterthresholding allows noise removal

� Automatic preprocessing� Suited for imaging spectroscopy data

[29]

Baselines correction Continuous wavelet transform þ peakwidth estimation by signal to noiseratio enhancingderivative þ background fitting usingpenalized least squares with binarymasks

� A three-step process involving accu-rate detection of peak position, fol-lowed by peak width estimation andlastly background fitting usingpenalized least squares using binarymasks

� Does not require any preprocessingfor transformation of spectra intothe wavelet space

� Combines wavelet transformation,derivatives and polynomial fitting

[33]

Adaptive reweighing scheme forpolynomial fitting and penalized leastsquares

� iteratively weights are changes forsum squares errors (SSE) betweenthe fitted baseline and originalsignals, and theweights of the SSE areobtained adaptively using thedifference between the previouslyfitted baseline and the originalsignals

� Automated approach to baselinecorrection

[35]

Multispectra baseline correction withasymmetric least squares

� Uses multiple signals to infer thecommon characteristics to learnslowly varying baselines

� Fast and outputs multiple baselinessimultaneously

[36]

Tikhonov regularization � Uses regularization to force theregression coefficients to beorthogonal to the directions ofmultiplicative effects

� Direct regression modelling byremoving the influence of baselines

[37]

Multiplicative effectscorrection

Variable sorting for normalization(VSN)

� Estimates weights for variables basedon severity of multiplicative effects

� Weights are estimated automatically� Weights can be incorporated into

traditional normalization techniquessuch as SNV, MSC etc.

[38]

Local standard normal variate (LSNV) � Splits the signal into intervals andperforms SNV separately for eachlocal part

� Easy to implement as it just consistsin the local application of a standardpreprocessing

[39]

First derivative þ simple spectral ratio(FD-SR), Linear regressioncorrection þ simple spectral ratio (LRC-SR), Orthogonal spatialprojection þ simple spectral ratio (OPS-SR)

� Uses two steps� The first one corrects for additive

effects using either first derivative,linear regression correction ororthogonal spatial projection

� The second step uses simple spectralratio for the correction ofmultiplicative effects

� Can estimate separate additive andmultiplicative correction factors foreach signal unlike tradition methodswhich use a general statistics such asmean and standard deviation forcorrection

[40][40][40]

Group aggregating normalization(GAN)

� Uses group assignment to estimatethe dilution factor to perform thenormalization

� Developed specifically to deal withthe multiplicative effects due todilution

[49]

Robust standardization, SNV, MSC,Detrending, offset correction

� Incorporates median and meanabsolute deviation to traditionalpreprocessing techniques

� Removes the effect of outliers duringthe corrections

[16]

Peaks alignments Multivariate curveresolution þ correlation optimizedwarping (MCR þ COW)

� At first, reduces complexity of thesignals by curve resolution and lateruses COW to align the profiles.

� Minimizes the risk of aligning non-corresponding information

[43][44]

Automatic time shift alignment (ATSA) Three step process:� Automatic baseline correction and

peak detection to provide usefulchromatographic information

� Preliminary alignment throughadaptive segment partition tocorrect alignment for the entirechromatogram

� Suited for chromatographic data� Parameters such as peak information,

segment size are automaticallyoptimized

[50]

(continued on next page)

P. Mishra, A. Biancolillo, J.M. Roger et al. Trends in Analytical Chemistry 132 (2020) 116045

5

Table 1 (continued )

Artefacts type Techniques Background principle Novel features References

� Precise alignment based on testchromatographic peak informationto accurately align time shifts

Coherent point drift peak alignment � Uses the point matching algorithmfrom computer vision domain

� Useful for 2D chromatography-massspectrometry data

� Global peak alignment� Useful for dense data, i.e., rich in

features

[51]

BiPACE 2D � Introduces a similarity measure forcomparing peaks based on thecloseness of their barycenters in 2DTIC images and the angle or innerproduct between the binned MSprofiles.

� All possible pairwise similaritiesbetween peaks in two (or more) 2Dchromatograms are estimated toidentify best matching signals

� Useful for 2D chromatography- massspectrometry data

� Suitable for chromatographic datawith increasing number of peaks

� Does not require user definedreference spectrum

[47]

Smith-Waterman peak alignment � Combines the Smith-Waterman localalignment with mass spectralsimilarity

� Eliminates the need for detection oflandmark peaks and usage ofretention time transformation

� Automated

[41]

Optimal peak alignment with mixturesimilarity measure

� Uses mixture similarity by employingpeak distance and the spectralsimilarity measures

� Does not require pre-defined win-dow to operate

[46]

Distance and spectrum correlationoptimization alignment (DISCO)

� Landmark peaks of samples aremapped to landmark peaks of thereference using Euclidean distance tocalculate similarity and correlationcoefficient.

� Later, local interpolation is applied tonon-landmark peaks to correctdistortion

� Allows on the fly alignment reducingmemory usage

[45]

Global peak alignment with pointmatching algorithm

� Extracts feature peaks in the signaland then searches globally thematching peaks

� Global alignment� Suitable for homogeneous as well as

heterogeneous data

[42]

P. Mishra, A. Biancolillo, J.M. Roger et al. Trends in Analytical Chemistry 132 (2020) 116045

DoE-based selection of the optimal combination of preprocessingtechniques, where the effect of different preprocessing techniquesis explored by considering each family of methods as a factor,whose levels are the individual techniques, also gained attention[14,52]. In particular, the DoE approach presented in Refs. [14]considers preprocessing selection as a 4 stage process involving thesequential application of baseline correction, scatter correction,noise removal and scaling. However, by doing so, it does not allowmore than one preprocessing inside each category to be explored.Grid search based approaches, which explore all possible combi-nations of preprocessing techniques to decide on the best, werealso developed [17,53]. However, they are computationally expen-sive compared to the DoE-based approach which performs a pre-selection of preprocessing techniques based on the design [14].Recently, a faster genetic algorithm partial least squares (GAPLS)approach was proposed for parallel comparison of the pre-processing techniques, in order to achieve an optimal selection[54]. The GAPLS based approach has the benefit that it can explorethe preprocessing in parallel but has the disadvantage compared tothe DoE based approach presented in Ref. [14] that it does not allowsequential selection and optimization of different preprocessingtasks such as baseline correction, followed by scatter correction,noise removal and scaling. Recently, a new process analyticaltechnology (PAT) based soft sensor development, assessment andcomparison (SS-DAC) framework was presented for optimal pre-processing selection [55]. To select the optimal preprocessingtechnique or combination, the SS-DAC framework performs anexhaustive pairwise statistical comparison to score the modelscorresponding to the different options tested. However, the maindrawback with all these preprocessing selection and optimizationapproaches (Table 2) is that they all aim at the selection of

6

techniques rather than using the complementary informationpresent in each preprocessing technique. In many cases, multiplepreprocessing techniques contain complementary informationwhich can be used for synergistic modelling [18,20,21].

6. Ensemble approaches to preprocessing fusion

Preprocessing techniques can enhance the quality of analyticalsignals by removing artefacts. However, due to the limitations ofthe different techniques and the complexity of the artefacts, asingle preprocessing technique may often not remove the artefactscompletely whereas, if used in a complementary way, a combina-tion of different preprocessings may be more effective. In the pre-sent section, three main approaches to ensemble preprocessingselection and fusion are presented.

6.1. Ensemble based on stacked regression

The stacked regression approach involves training multipleregression models with different preprocessings and thencombining themwith approaches such as cross validation or modelaveraging. Such a strategy was presented in Xu, Zhou et al. [18],where PLS models built on differently preprocessed data arecombined by Monte Carlo cross validation (MCCV) stackedregression. The results show that the fusion of complementaryinformation obtained by different preprocessing techniques oftenleads to a more stable and accurate calibration model. A summaryof the stacked regression approach [18] is presented in Fig. 2. Atfirst, some common preprocessing methods are used to transformthe data. Secondly, optimized calibration models are built ondifferently preprocessed data. The third step is to combine the

Table 2A summary of spectral preprocessing selection and optimization.

Selection approach Background idea Key features References

OPLS � Identities the joint and unique variation between rawand preprocessed data to evaluate the preprocessingtechnique

� Allows exploration of individual preprocessingtechniques

[15]

DoE based approach � Considers different artefact correction techniques asdifferent factors and uses Design of Experiments (D-optimal, full factorial) to search for the optimalcombination

� Based on the design, model performance of a fewpreprocessing methods and combination areevaluated

[14,52,56]

Regularized MANOVA-basedpreprocessing optimization

� Assesses the goodness of preprocessing techniquesby using the ratio of the between-group to withingroup variance on the first canonical variate derivedfrom regularized MANOVA

� Grid search is used to find the best preprocessingstrategy

� Useful for preprocessing selection in case ofclassification task

[53]

Soft sensor development,assessment and comparison(SS-DAC)

� Uses exhaustive statistical pairwise comparison ofmultiple models developed based on differentpreprocessings and their combination

� Uses the standardized process analytical technologyframework for soft sensor development

[55]

Grid search of all possiblecombinations

� Allows exploration of all preprocessing possibilitiesand combinations

� Defines the grid to select the best

� Python based open source toolbox available with acollection of preprocessing techniques

[17]

Faster GA-PLS based optimalpreprocessing identification

� Faster genetic algorithm - partial least squaresmodified to efficiently operate on multiple versionsof same data

� Can be used to explore preprocessing techniques inparallel

� Can be used for faster exploration of tall datasets

[54]

P. Mishra, A. Biancolillo, J.M. Roger et al. Trends in Analytical Chemistry 132 (2020) 116045

models to achieve the final model for ensemble calibration andprediction. The combination coefficients vector W is estimated byMCCV stacked regression. MCCV stacked regression deduces W bynon-negative least squares (NNLS) according to the followingequation:

Y ¼�by1; by2; by3; :::byn

�W (1)

where Y contains the reference concentration values for the left-out samples during MCCV resampling and byi contains the corre-sponding concentration value estimated by the i-th model; n is thenumber of models and corresponds to the number of differentpreprocessings tested. The stacking approach is easy to use and canbe implemented using standard chemometric toolboxes whichoffer PLS modelling. However, a drawback of the stacked approachis that it does not provide insight into what new information aspecific preprocess carries and to what extent it improves theoverall performances. Also, there are a large number of models tobe trained and optimized which might become computationallyexpensive.

6.2. Ensemble based on design of experiments

The ensemble based on DoE [19] takes its inspiration from theDoE based preprocessing selection and optimization approach [14].Similarly to preprocessing selection approach, the ensembleapproach conceives preprocessing as a four-stage process. How-ever, instead of selection and optimization of the preprocessing asin the traditional DoE-based approach, the ensemble approachexplores all the models corresponding to individual preprocessingtechniques and all their possible combinations according to a full-factorial design. An example of DoE based ensemble approach isreported in Refs. [19] where PLS models based on numerous pre-processing methods and all their combinations obtainable by a fullfactorial design (sequentially applying baseline correction, scat-tering correction, smoothing and scaling) are calculated (Fig. 3).Finally, to perform the ensemble learning, the models which givebetter predictions than PLS on the raw data are selected and their

7

predictions are averaged as the final prediction. A benefit of DoE-based approaches is that, if a full factorial design is selected, it al-lows exploration of all possible combinations of preprocessingtechniques. Furthermore, compared to the stacked regressionapproach where all the model outputs were combined in the endusing the weight vector, the DoE-based approach only combinesthe models which have better performance than the PLSRmodel onthe raw data. However, similarly to the stacked regressionapproach, the DoE-based approach also has the disadvantage that isdoes not provide insight into how combining different techniquesimproves the model.

6.3. Methods inspired from multi-block data analysis

Multi-block data analysis is emerging as a key tool for per-forming multi-sensor data integration and fusion [57,58]. Anapplication of multi-block data analysis can be understood as theensemble fusion of multiple preprocessing techniques. A recentarticle demonstrated the benefits of using the multi-blockapproach for spectral data preprocessing, where several pre-treatments were combined using sequential and orthogonalizedpartial least squares (SO-PLS), thus leading to a boosting procedure[20,21]. The corresponding strategy was given the name SPORT, asthe acronym of sequential preprocessing through orthogonaliza-tion. The results showed that not only the multi-block approachallowed fusion of multiple preprocessings, but it also helped inidentifying the best preprocessing techniques and their combina-tions. A schema of the SPORT approach is presented in Fig. 4. TheSPORT approach involves a series of PLS and orthogonalizationsteps to extract the unique complementary information from eachdifferently preprocessed data block. As shown in Fig. 4, SPORTstarts by calculating a PLS model between the first data block (1stpreprocessing) and Y, giving the PLS scores for the first data blockand a first (partial) prediction of the response. Successively, boththe second data block (2nd preprocessing) and the Y are orthogo-nalizedwith respect to the scores of the first regression. This allowsto remove from the second block of predictors any redundant in-formation (i.e., any source of variation alreadymodeled by the firstblock). A second PLS model is then built between the

Fig. 2. A summary of the stacking approach to ensemble of preprocessing techniques. At first, some standard preprocessing methods are used to transform the data. Secondly,optimized calibration models are built on differently preprocessed data. The third step is to combine the sub-models to obtain the final global model for ensemble calibration andprediction.

P. Mishra, A. Biancolillo, J.M. Roger et al. Trends in Analytical Chemistry 132 (2020) 116045

orthogonalized second block and Y residuals of the first regression,giving the PLS scores for the second data block and an updatedprediction of the response. The procedure continues until all thedesired blocks (corresponding to individual preprocessings) havebeen used formodelling. Finally, the PLS scores from each block areconcatenated and used to predict the response through ordinaryleast squares regression. The main advantage of SPORT comparedto stacked regression and DoE-based ensemble approaches is thatit allows exploring the unique information present in each datablock (individual preprocessings) with the help of scores andloading plots [20]. However, a key issue of the SPORT approach isthat, being a sequential method, it requires the user to define theorder of the different preprocessings to be tested. This is a funda-mental aspect to be considered and, at the same time, it is notnecessarily a drawback, but it can represent a potential source ofadditional information, since models built on differently orderedblocks could highlight different aspects of the same data set.Indeed, if, on one hand, the need of defining the order of thedifferent preprocessings to be tested could add a further meta-parameter to be optimized, it is also true that the literature onSO-PLS, which constitutes the modeling engine of SPORT suggeststhat the predictive performances are hardly affected by the

8

sequence of the blocks [21,58]. Therefore, where the focus is onlyon the predictive accuracy, the challenge of defining the order ofpreprocessings can easily be dealt with, e.g., by placing all thefaster, easier and model-free preprocessing techniques at the startand reserving complex, time-consuming and model based pre-processings for the end. On the other hand, the sequential pro-cedure based on successive orthogonalization steps is such that ablock will contribute to the model (with a non-zero number oflatent variables) only if it carries unique information (i.e., infor-mation not already accounted for by the previously modeledmatrices). This means that, by building different models on thesame data set but just changing the order in which the individualmatrices are presented, can provide a deeper insight into therelationship between the various preprocessing, in terms of whichinformation is shared among all the blocks and which is unique ofspecific pretreatments. Similarly, one could in principle identifythe minimum number of pretreatments carrying non-redundantinformation. A recent application of the SPORT approach for NIRmodelling of highly scattering materials showed that multiplescatter correction techniques carry complementary informationand that themodel accuracies improvedwhen theywere combinedthrough SPORT [21].

Fig. 3. A summary of the DoE-based ensemble approach. Several PLS models based on numerous preprocessing methods and their combinations (sequentially applying baselinecorrection, scattering correction, smoothing and scaling) are obtained by full factorial design. Finally, to perform the ensemble learning, the models which give better predictionsthan PLS on the raw data are selected and their predictions are averaged as the final prediction.

Fig. 4. A summary of the SPORT approach to ensemble fusion of preprocessing techniques. The SPORT approach develops a PLS model between the first data block (1st pre-processing) and Y giving the PLS scores for the first data block. The scores from the first data block are used to orthogonalize the second data block (2nd preprocessing) and the Y.Then a new PLS model is built between the orthogonalized second block and the orthogonalized Y giving the PLS scores for the second data block. This continues until scores areextracted from all the desired blocks (individual preprocessings and their combinations). The final model is obtained by relating the concatenated PLS scores from the differentblocks to the response by means of ordinary least squares regression.

P. Mishra, A. Biancolillo, J.M. Roger et al. Trends in Analytical Chemistry 132 (2020) 116045

9

Fig. 5. Summary of SPORT preprocessing fusion for the olive fruit dataset. (AeD) Prediction performances of the PLS models built on data preprocessed by individual scattercorrection techniques: (A) SNV, (B) VSN, (C) MSC, (D) RNV. (E) Optimal complexity of the SPORT model: only the VSN and MSC blocks contribute with a non-zero number of latentvariables. and (F) Prediction performances of the optimal SPORT model.

Fig. 6. Summary of SPORT preprocessing fusion for the beer classification dataset. (AeC) Prediction performances of the PLS-DA models built on data preprocessed by individualscatter correction techniques: (A) SNV (2LVs), (B) MSC (2LVs), (C) 2nd derivative (2LVs); and (D) Prediction performances of the optimal SPORT model, fusing the information fromSNV (2 LVs), MSC (8 LVs) and 2nd derivative (2 LVs) preprocessed data.

P. Mishra, A. Biancolillo, J.M. Roger et al. Trends in Analytical Chemistry 132 (2020) 116045

10

P. Mishra, A. Biancolillo, J.M. Roger et al. Trends in Analytical Chemistry 132 (2020) 116045

7. An example of sequential preprocessing throughorthogonalization

7.1. Olive dry matter regression case

To give the reader a better understanding of the ensembleapproach to spectral preprocessing fusion, a demonstration caseinvolving the use of the sequential preprocessing throughorthogonalization (SPORT) approach is provided. In particular, theexample demonstrates the use of SPORT in the context of thecharacterization of fresh fruit by NIR spectroscopy, as the signalssuffer from scattering effects due to the interaction of lightwith thephysical structure of the fruit peel. The dataset used was related tothe possibility of predicting dry matter (DM) in olive fruits with ahandheld NIR sensor [59]. A total of 494 individual NIR spectra andcorresponding DM measurements were used for the modelling.The dataset was divided into calibration (70%) and test set (30%)using the Kennard-Stone algorithm [60]. Four different scattercorrection techniques, i.e. SNV (Block 1), VSN (Block 2), MSC (Block3) and RNV (Block 4), were used in sequential order. SPORT wasimplemented as explained in Ref. [20] with the freely availablemulti-block analysis (MBA) toolbox [61]. The results of SPORT onthe olive fruits dataset are shown in Fig. 5. Fig. 5AeD shows theresults of standard PLSR calibration for SNV, VSN, MSC and RNVpreprocessed data, respectively. Fig. 5E shows the number ofcomplementary LVs (4 from VSN preprocessed data and 5 fromMSC preprocessed data) selected by the SPORT approach. Fig. 5Fshows the results of the SPORT model fusing the information fromVSN and MSC preprocessed data. The SPORT fusion reduced theprediction error by 14.5% and increased the prediction R2 by 3.3%compared to the best performing model calculated on individuallypreprocessed data. In summary, fusion of multiple scatter correc-tion techniques improved the model performance.

7.2. Beer classification case

The sequential preprocessing fusion approach SPORTcan also beimplemented in the context of classification problems. In Fig. 6, anexample of Italian craft beer classification using NIR spectroscopydata is presented. The data used in this example are a subset of thoseused in a previous research to authenticate an Italian craft beer anddifferentiate it from other similar products [62]. Accordingly, theclassification problem involves two categories, Reale beer (the craftbeer of interest) and other beers. The calibration set contains a totalof 40 samples and the test set consists of 20 samples. The spectralrange of the data is 4000e10000 cm�1 with a nominal resolution of4 cm�1 [62]. At first three separate PLS-DA models, each based ondata individually pretreatedwith either SNV, MSC or 2nd derivativewere developed and tested. Successively, the data pretreated withthe three different preprocessings were jointly used for sequentialmulti-block fusionby the SPORTapproach. The sequentialmodellingwas performed using the SO-PLS-LDA option in the freely availablemulti-blockanalysis (MBA) toolbox [61]. It canbenoted that thedatapretreated with any of the individual preprocessing techniques(Fig. 6) resulted in a classification accuracyof 75%.However,with thesequential fusion approach the classification accuracy on the samedata was improved to 85%, thus indicating the benefit of combiningthe useful information from different preprocessings for enhancedmodel performances. Some other recent applications of SPORT canbe found in [21,63].

8. Concluding remarks

The sub-optimal selection of preprocessing has long been aproblem in chemometrics and that is why, in recent years,

11

several approaches for a more rational selection of the bestpreprocessings have been proposed. However, all the pre-processing selection strategies have the disadvantage that theyare focused on selecting the preprocessing rather exploring thecomplementary information present in differently preprocesseddata for synergistic modelling. In this study, three approaches toensemble preprocessing fusion were identified. The three ap-proaches were stacked regression, full factorial DoE-basedensemble and sequential multi-block data analysis by treatingdifferently preprocessed data as separate blocks. Of the threeapproaches, the stacking and DoE-based ones were based onweighting or averaging the final outputs using either the weightvector or mean estimation, respectively. Both these approacheslack insight into what new information a new preprocessingtechnique carries. On the other hand, the multi-block dataanalysis-based approach (SPORT) has the main advantage that itextracts the unique information from differently preprocesseddata. Furthermore, since it is based on the sequential calculationof PLS regression models, the set of scores and loadings can beused to explore and interpret what incremental (new) informa-tion is brought by the different blocks, i.e. by the different pre-processing techniques applied to the same data. Moreover, theelimination of a block not contributing to the final model withany latent variables directly shows that the application of thecorresponding preprocessing technique does not add any newinformation: accordingly, some techniques can be discarded,leading to a preprocessing selection. Therefore, it can beconcluded that, to have a better understanding of the effect ofeach individual preprocessing in conjunction to modelimprovement, multi-block analysis-based preprocessing fusion issuperior to model averaging-based methods.

Declaration of Competing Interest

The authors declare that they have no known competingfinancial interests or personal relationships that could haveappeared to influence the work reported in this paper.

References

[1] A. Biancolillo, F. Marini, Chemometric methods for spectroscopy-basedpharmaceutical analysis, Front. Chem. 6 (2018) 576.

[2] E. Gorrochategui, et al., Data analysis strategies for targeted and untargetedLC-MS metabolomic studies: overview and workflow, Trac. Trends Anal.Chem. 82 (2016) 425e442.

[3] H.I. Kademi, B.H. Ulusoy, C. Hecer, Applications of miniaturized and portablenear infrared spectroscopy (NIRS) for inspection and control of meat and meatproducts, Food Rev. Int. 35 (3) (2019) 201e220.

[4] R. Deidda, et al., Vibrational spectroscopy in analysis of pharmaceuticals:critical review of innovative portable and handheld NIR and Raman spectro-photometers, Trac. Trends Anal. Chem. 114 (2019) 251e259.

[5] P. Lucci, J. Saurina, O. Nunez, Trends in LC-MS and LC-HRMS analysis andcharacterization of polyphenols in food, Trac. Trends Anal. Chem. 88(2017) 1e24.

[6] M. Zoccali, P.Q. Tranchida, L. Mondello, Fast gas chromatography-mass spec-trometry: a review of the last decade, Trac. Trends Anal. Chem. 118 (2019)444e452.

[7] J. Engel, et al., Breaking with trends in pre-processing? Trac. Trends Anal.Chem. 50 (2013) 96e106.

[8] J.-M. Roger, et al., Pre-processing Methods, in: Steven D. Brown, Rom�a Tauler,Beata Walczak (Editors), Comprehensive chemometrics. Chemical andbiochemical data analysis, second ed., Elsevier, Oxford, 2020, pp. 1e75.

[9] R. Lu, et al., Measurement of optical properties of fruits and vegetables: areview, Postharvest Biol. Technol. 159 (2020) 111003.

[10] Å. Rinnan, F.v.d. Berg, S.B. Engelsen, Review of the most common pre-processing techniques for near-infrared spectra, Trac. Trends Anal. Chem. 28(10) (2009) 1201e1222.

[11] W. Saeys, et al., Multivariate calibration of spectroscopic sensors for post-harvest quality evaluation: a review, Postharvest Biol. Technol. (2019) 158.

[12] L.C. Lee, C.Y. Liong, A.A. Jemain, A contemporary review on Data Preprocessing(DP) practice strategy in ATR-FTIR spectrum, Chemometr. Intell. Lab. Syst. 163(2017) 64e75.

P. Mishra, A. Biancolillo, J.M. Roger et al. Trends in Analytical Chemistry 132 (2020) 116045

[13] P. Oliveri, et al., The impact of signal pre-processing on the final interpretationof analytical outcomes e a tutorial, Anal. Chim. Acta 1058 (2019) 9e17.

[14] J. Gerretzen, et al., Simple and effective way for data preprocessing selectionbased on design of Experiments, Anal. Chem. 87 (24) (2015) 12096e12103.

[15] J. Gabrielsson, et al., OPLS methodology for analysis of pre-processing effectson spectroscopic data, Chemometr. Intell. Lab. Syst. 84 (1e2) (2006)153e158.

[16] S. Verboven, M. Hubert, P. Goos, Robust preprocessing and model selection forspectral data, J. Chemometr. 26 (6) (2012) 282e289.

[17] J. Torniainen, et al., Open-source python module for automated preprocessingof near infrared spectroscopic data, Anal. Chim. Acta 1108 (2020) 1e9.

[18] L. Xu, et al., Ensemble preprocessing of near-infrared (NIR) spectra formultivariate calibration, Anal. Chim. Acta 616 (2) (2008) 138e143.

[19] X. Bian, et al., A selective ensemble preprocessing strategy for near-infraredspectral quantitative analysis of complex samples, Chemometr. Intell. Lab.Syst. 197 (2020) 103916.

[20] J.-M. Roger, A. Biancolillo, F. Marini, Sequential preprocessing throughORThogonalization (SPORT) and its application to near infrared spectroscopy,Chemometr. Intell. Lab. Syst. 199 (2020) 103975.

[21] P. Mishra, et al., SPORT pre-processing can improve near-infrared qualityprediction models for fresh fruits and agro-materials, Postharvest Biol.Technol. 168 (2020) 111271.

[22] A. Folch-Fortuny, F. Arteaga, A. Ferrer, Missing data imputation toolbox forMATLAB, Chemometr. Intell. Lab. Syst. 154 (2016) 93e100.

[23] F. Arteaga, A. Folch-Fortuny, A. Ferrer, 2.29 - Missing Data, in: S. Brown,R. Tauler, B. Walczak (Editors), Comprehensive Chemometrics, second ed.,Elsevier, Oxford, 2020, pp. 615e639.

[24] J. Trygg, J. Gabrielsson, T. Lundstedt, 3.06 - background estimation, denoising,and preprocessing, in: S. Brown, R. Tauler, B. Walczak (Editors), Compre-hensive Chemometrics, second ed., Elsevier, Oxford, 2020, pp. 137e141.

[25] P. Filzmoser, B. Walczak, What can go wrong at the data normalization stepfor identification of biomarkers? J. Chromatogr. A 1362 (2014) 194e205.

[26] R.H. Jellema, A. Folch-Fortuny, M.M.W.B. Hendriks, 3.05 - variable shift andAlignment*, in: S. Brown, R. Tauler, B. Walczak (Editors), ComprehensiveChemometrics, second ed., Elsevier, Oxford, 2020, pp. 115e136.

[27] D.F. Thekkudan, S.C. Rutan, 3.07 - denoising and signal-to-noise ratioenhancement: classical filtering, in: S. Brown, R. Tauler, B. Walczak (Editors),Comprehensive Chemometrics, second ed., Elsevier, Oxford, 2020,pp. 143e155.

[28] V.M. Taavitsainen, 3.09 - denoising and signal-to-noise ratio enhancement:splines, in: S. Brown, R. Tauler, B. Walczak (Editors), Comprehensive Che-mometrics, second ed., Elsevier, Oxford, 2009, pp. 165e177.

[29] P. Mishra, et al., Automatic de-noising of close-range hyperspectral imageswith a wavelength-specific shearlet-based image noise reduction method,Sensor. Actuator. B Chem. 281 (2019) 1034e1044.

[30] A. Kohler, et al., 3.03 - model-based pre-processing in vibrational spectros-copy, in: S. Brown, R. Tauler, B. Walczak (Editors), Comprehensive Chemo-metrics, second ed., Elsevier, Oxford, 2020, pp. 83e100.

[31] A. Folch-Fortuny, F. Arteaga, A. Ferrer, PCA model building with missing data:new proposals and a comparative study, Chemometr. Intell. Lab. Syst. 146(2015) 77e88.

[32] C. Zhang, et al., Noise reduction in the spectral domain of hyperspectral im-ages using denoising autoencoder methods, Chemometr. Intell. Lab. Syst. 203(2020) 104063.

[33] Z.M. Zhang, et al., An intelligent background-correction algorithm for highlyfluorescent samples in Raman spectroscopy, J. Raman Spectrosc. 41 (6) (2010)659e669.

[34] P.J. Cadusch, et al., Improved methods for fluorescence background subtrac-tion from Raman spectra, J. Raman Spectrosc. 44 (11) (2013) 1587e1595.

[35] Z.M. Zhang, S. Chen, Y.Z. Liang, Baseline correction using adaptive iterativelyreweighted penalized least squares, Analyst 135 (5) (2010) 1138e1146.

[36] J.T. Peng, et al., Asymmetric least squares for multiple spectra baselinecorrection, Anal. Chim. Acta 683 (1) (2010) 63e68.

[37] J. Skogholt, K.H. Liland, U.G. Indahl, Baseline and interferent correction by theTikhonov regularization framework for linear least squares modeling,J. Chemometr. 32 (3) (2018).

[38] G. Rabatel, et al., VSN: variable sorting for normalization, J. Chemometr. 34 (2)(2020) e3164.

[39] Y. Bi, et al., A local pre-processing method for near-infrared spectra, combinedwith spectral segmentation and standard normal variate transformation, Anal.Chim. Acta 909 (2016) 30e40.

12

[40] L. Li, et al., A new scattering correction method of different spectroscopicanalysis for assessing complex mixtures, Anal. Chim. Acta 1087 (2019) 20e28.

[41] S. Kim, et al., Smith-Waterman peak alignment for comprehensive two-dimensional gas chromatography-mass spectrometry, BMC Bioinf. 12 (2011).

[42] B.C. Deng, et al., Global peak alignment for comprehensive two-dimensionalgas chromatography mass spectrometry using point matching algorithms,J. Bioinf. Comput. Biol. 14 (6) (2016).

[43] C. Tistaert, Y. Vander Heyden, Bilinear decomposition based alignment ofchromatographic profiles, Anal. Chem. 84 (13) (2012) 5653e5660.

[44] H. Parastar, N. Akvan, Multivariate curve resolution based chromatographicpeak alignment combined with parallel factor analysis to exploit second-order advantage in complex chromatographic measurements, Anal. Chim.Acta 816 (2014) 18e27.

[45] B. Wang, et al., DISCO: distance and spectrum correlation optimizationalignment for two-dimensional gas chromatography time-of-flight massspectrometry-based metabolomics, Anal. Chem. 82 (12) (2010) 5069e5081.

[46] S. Kim, et al., An optimal peak alignment for comprehensive two-dimensionalgas chromatography mass spectrometry using mixture similarity measure,Bioinformatics 27 (12) (2011) 1660e1666.

[47] N. Hoffmann, et al., BiPACE 2D-graph-based multiple alignment for compre-hensive 2D gas chromatography-mass spectrometry, Bioinformatics 30 (7)(2014) 988e995.

[48] A. Folch-Fortuny, F. Arteaga, A. Ferrer, Assessment of maximum likelihoodPCA missing data imputation, J. Chemometr. 30 (7) (2016) 386e393.

[49] J.Y. Dong, et al., Group aggregating normalization method for the pre-processing of NMR-based metabolomic data, Chemometr. Intell. Lab. Syst. 108(2) (2011) 123e132.

[50] Q.X. Zheng, et al., Automatic time-shift alignment method for chromato-graphic data analysis, Sci. Rep. 7 (2017).

[51] Z.Y. Li, et al., Coherent point drift peak alignment algorithms using distanceand similarity measures for two-dimensional gas chromatography massspectrometry data, J. Chemometr. 34 (8) (August 2020).

[52] H. Zheng, et al., Optimal preprocessing of serum and urine metabolomic datafusion for staging prostate cancer through design of experiment, Anal. Chim.Acta 991 (2017) 68e75.

[53] A. Martyna, et al., Improving discrimination of Raman spectra by optimisingpreprocessing strategies on the basis of the ability to refine the relationshipbetween variance components, Chemometr. Intell. Lab. Syst. 202 (2020)104029.

[54] P. Stefansson, et al., Fast method for GA-PLS with simultaneous feature se-lection and identification of optimal preprocessing technique for datasetswith many observations, J. Chemometr. 34 (3) (2020).

[55] T.J. Rato, M.S. Reis, SS-DAC: a systematic framework for selecting the bestmodeling approach and pre-processing for spectroscopic data, Comput. Chem.Eng. 128 (2019) 437e449.

[56] J. Gerretzen, et al., Boosting model performance and interpretation byentangling preprocessing selection and variable selection, Anal. Chim. Acta938 (2016) 44e52.

[57] A.K. Smilde, et al., Common and distinct components in data fusion,J. Chemometr. 31 (7) (2017), e2900.

[58] Alessandra Biancolillo, Tormod Næs, The sequential and orthogonalized PLSregression for multiblock regression: theory, examples, and extensions, in:Marina Cocchi (Editor), Data Fusion Methodology and Applications, DataHandling in Science and Technology 31, Elsevier, Amsterdam, 2019,pp. 157e177.

[59] X. Sun, et al., NIRS prediction of dry matter content of single olive fruit withconsideration of variable sorting for normalisation pre-treatment, PostharvestBiol. Technol. 163 (2020) 111140.

[60] R.W. Kennard, L.A. Stone, Computer aided design of Experiments, Techno-metrics 11 (1) (1969) 137e148.

[61] P. Mishra, et al., MBA-GUI: A Chemometric Graphical User Interface for Multi-Block Data Visualisation, Regression, Classification, Variable Selection andAutomated Pre-processing, Chemometrics and Intelligent Laboratory Systems,2020, p. 104139.

[62] A. Biancolillo, et al., Data-fusion for multiplatform characterization of anItalian craft beer aimed at its authentication, Anal. Chim. Acta 820 (2014)23e31.

[63] Puneet Mishra, et al., Improved prediction of fuel properties with near-infraredspectroscopy using a complementary sequential fusion of scatter correctiontechniques, Talanta (2020), 121693. https://doi.org/10.1016/j.talanta.2020.121693. In press.