Application of the replacement method as a novel variable selection strategy in QSAR. 1....

8
Application of the replacement method as a novel variable selection strategy in QSAR. 1. Carcinogenic potential Aliuska Helguera Morales a,b , Pablo R. Duchowicz c , Miguel Ángel Cabrera Pérez b , Eduardo A. Castro c , Maria Natália Dias Soeiro Cordeiro d , Maykel Pérez González b,e, a Department of Chemistry, Faculty of Chemistry and Pharmacy, Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cuba b Department of Drug Design, Chemical Bioactive Center, Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cuba c INIFTA, División Química Teórica, Departamento de Química, Facultad de Ciencias Exactas, Universidad Nacional de La Plata, Diag. 113 y 64, Suc. 4, C.C. 16, (1900) La Plata, Argentina d REQUIMTE, Departamento de Química, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, 687, 4169-007 Porto, Portugal e Unit of Service, Department of Drug Design, Experimental Sugar Cane Station Villa Clara-Cienfuegos, Ranchuelo, Villa Clara, 53100, Cuba Received 8 September 2005; received in revised form 9 November 2005; accepted 6 December 2005 Available online 27 January 2006 Abstract The carcinogenic potency (TD 50 ) of a set of 62 nitrosocompounds is predicted, applying the QSAR theory. Around thousands of molecular descriptors obtained from the DRAGON 2.1 software are used in order to model the toxicological property bioassay in female rat and considering the water as route of administration. For building the regression model three different methods of variable selection are used, namely the Forward Stepwise Regression, the Genetics Algorithms, and an alternative of the Elimination Method: the Replacement Method. The Replacement Method is used, for the first time, to predict the carcinogenic potency, achieving the best results. The best obtained model had seven variables and it was able to explain 84.3% of the experimental variance after removing 6 chemicals, which are considered as outliers. © 2005 Elsevier B.V. All rights reserved. Keywords: Nitrosocompounds; Carcinogenic potency; Variable selection; Replacement Method; LOO and LGO cross-validation; Orthogonalization 1. Introduction It has long been known that exposure to certain chemicals is associated with the development of cancers. Such chemicals are defined as carcinogens and they are divided into two major categories: non-genotoxic or epigenetic (non-DNA-reactive) and genotoxic (DNA-reactive), depending on their chemical and biological properties [1]. The non-genotoxic carcinogens are able to act without causing DNA damage; for instance, they can induce uncon- trolled cell proliferation by altering inter-cellular communica- tion [1]. In contrast, the genotoxic agents are those chemicals which can interact with the DNA and thus produce alterations in the genetic material of the host; they can also be subdivided into direct-acting genotoxins and indirect-acting genotoxins [2]. Direct genotoxic agents are intrinsically reactive to DNA. On the other hand, indirect genotoxic agents require metabolic activation by cellular enzymes to form the DNA-reactive metabolites. A good example of it is the nitrosocompound, specially the nitrosamine which requires metabolic activation to exert their carcinogenic effects [3,4]. Of the 300 nitrosocompounds that have been tested for their potential carcinogenic, more than 90% show carcinogenic potential [5]. The human population can be exposed to pre- formed nitrosocompounds, or these may also be synthesized endogenously from precursors and nitrosating agents, princi- pally in the gastro-intestinal tract, thus representing a potential risk to human health. In addition, the carcinogenic assays are lengthy, costly (typically 2.5 million dollars), and involve the use of many animals subjected to adverse welfare conditions [6]. It is evident that the rodent bioassay is inappropriate for the screening of a large number of chemicals, justifying therefore the search of alternative methods. Chemometrics and Intelligent Laboratory Systems 81 (2006) 180 187 www.elsevier.com/locate/chemolab Corresponding author. Tel.: +53 42 281473, +53 42 281192; fax: +53 42 281130. E-mail address: [email protected] (M.P. González). 0169-7439/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2005.12.002

Transcript of Application of the replacement method as a novel variable selection strategy in QSAR. 1....

ry Systems 81 (2006) 180–187www.elsevier.com/locate/chemolab

Chemometrics and Intelligent Laborato

Application of the replacement method as a novel variable selection strategyin QSAR. 1. Carcinogenic potential

Aliuska Helguera Morales a,b, Pablo R. Duchowicz c, Miguel Ángel Cabrera Pérez b,Eduardo A. Castro c, Maria Natália Dias Soeiro Cordeiro d, Maykel Pérez González b,e,⁎

a Department of Chemistry, Faculty of Chemistry and Pharmacy, Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cubab Department of Drug Design, Chemical Bioactive Center, Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cuba

c INIFTA, División Química Teórica, Departamento de Química, Facultad de Ciencias Exactas, Universidad Nacional de La Plata, Diag. 113 y 64,Suc. 4, C.C. 16, (1900) La Plata, Argentina

d REQUIMTE, Departamento de Química, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, 687, 4169-007 Porto, Portugale Unit of Service, Department of Drug Design, Experimental Sugar Cane Station Villa Clara-Cienfuegos, Ranchuelo, Villa Clara, 53100, Cuba

Received 8 September 2005; received in revised form 9 November 2005; accepted 6 December 2005Available online 27 January 2006

Abstract

The carcinogenic potency (TD50) of a set of 62 nitrosocompounds is predicted, applying the QSAR theory. Around thousands of moleculardescriptors obtained from the DRAGON 2.1 software are used in order to model the toxicological property bioassay in female rat and consideringthe water as route of administration. For building the regression model three different methods of variable selection are used, namely the ForwardStepwise Regression, the Genetics Algorithms, and an alternative of the Elimination Method: the Replacement Method. The Replacement Methodis used, for the first time, to predict the carcinogenic potency, achieving the best results. The best obtained model had seven variables and it wasable to explain 84.3% of the experimental variance after removing 6 chemicals, which are considered as outliers.© 2005 Elsevier B.V. All rights reserved.

Keywords: Nitrosocompounds; Carcinogenic potency; Variable selection; Replacement Method; LOO and LGO cross-validation; Orthogonalization

1. Introduction

It has long been known that exposure to certain chemicals isassociated with the development of cancers. Such chemicals aredefined as carcinogens and they are divided into two majorcategories: non-genotoxic or epigenetic (non-DNA-reactive)and genotoxic (DNA-reactive), depending on their chemicaland biological properties [1].

The non-genotoxic carcinogens are able to act withoutcausing DNA damage; for instance, they can induce uncon-trolled cell proliferation by altering inter-cellular communica-tion [1]. In contrast, the genotoxic agents are those chemicalswhich can interact with the DNA and thus produce alterations inthe genetic material of the host; they can also be subdivided into

⁎ Corresponding author. Tel.: +53 42 281473, +53 42 281192; fax: +53 42281130.

E-mail address: [email protected] (M.P. González).

0169-7439/$ - see front matter © 2005 Elsevier B.V. All rights reserved.doi:10.1016/j.chemolab.2005.12.002

direct-acting genotoxins and indirect-acting genotoxins [2].Direct genotoxic agents are intrinsically reactive to DNA. Onthe other hand, indirect genotoxic agents require metabolicactivation by cellular enzymes to form the DNA-reactivemetabolites. A good example of it is the nitrosocompound,specially the nitrosamine which requires metabolic activation toexert their carcinogenic effects [3,4].

Of the 300 nitrosocompounds that have been tested for theirpotential carcinogenic, more than 90% show carcinogenicpotential [5]. The human population can be exposed to pre-formed nitrosocompounds, or these may also be synthesizedendogenously from precursors and nitrosating agents, princi-pally in the gastro-intestinal tract, thus representing a potentialrisk to human health. In addition, the carcinogenic assays arelengthy, costly (typically 2.5 million dollars), and involve theuse of many animals subjected to adverse welfare conditions[6]. It is evident that the rodent bioassay is inappropriate for thescreening of a large number of chemicals, justifying thereforethe search of alternative methods.

Table 1The observed, predicted and residual values of the 62 aromatic nitrosocompounds used to derive the QSTR models

No. Name Log(TD50) Log(TD50)

TD50 Obs Pred. Residual

1 N-nitrosobis(2,2,2-trifluoroethyl)amine 4.980 −0.697 −0.914 0.2172 Nitrosoamylurethane 0.336 0.474 0.193 0.2813 1-Amyl-1-nitrosourea 0.967 0.015 0.122 −0.1074 N-n-Butyl-N-nitrosourea 1.020 −0.009 0.387 −0.3955 Carboxymethylnitrosourea 1.380 −0.140 −0.467 0.3276 1,3-Dibutyl-1-nitrosourea 2.230 −0.348 0.246 −0.5947 N-Nitrosodiethylamine 0.008 2.104 2.037 0.0678 N-Nitroso-2,3-dihydroxypropyl-2-hydroxypropylamine 0.054 1.272 1.394 −0.1239 Diallylnitrosamine 32.100 −1.507 −0.388 −1.11810 N-Nitrosodimethylamine 0.042 1.373 1.121 0.25211 1-Nitroso-3,5-dimethyl-4-benzoylpiperazine 9.100 −0.959 −0.419 −0.54012 Dinitrosohomopiperazine 0.030 1.527 1.629 −0.10213 Ethylnitrosocyanamide 2.910 −0.464 −0.439 −0.02514 1-Ethylnitroso-3-(2-hydroxyethyl)-urea 0.714 0.146 −0.169 0.31615 1-Ethyl-1-nitrosourea 0.069 1.163 0.874 0.28916 Nitrosoethylurethane 0.055 1.257 0.477 0.78117 1-Ethylnitroso-3-(2-oxopropyl)-urea 0.605 0.218 −0.122 0.34018 1-(2-Hydroxyethyl)-nitroso-3-chloroethylurea 0.669 0.175 0.211 −0.03619 1-(2-Hydroxyethyl)-1-nitrosourea 0.320 0.495 0.423 0.07220 1-(2-Hydroxyethyl)-nitroso-3-ethylurea 0.501 0.300 0.045 0.25521 N-Nitrosoallyl-2-hydroxypropylamine 0.877 0.057 −0.098 0.15522 1-Nitrosohydantoin 19.600 −1.292 −0.986 −0.30623 N-Nitroso-N-isobutylurea 4.700 −0.672 0.415 −1.08724 N-Nitroso-(2-hydroxypropyl)-(2-hydroxyethyl)amine 1.020 −0.009 0.483 −0.49225 Nitrosoethylmethylamine 0.039 1.413 1.032 0.38126 N-Methyl-N-nitrosobenzamide 11.200 −1.049 −0.747 −0.30227 Methylnitrosocyanamide 0.480 0.319 0.621 −0.30228 Nitrosoanabasine 12.900 −1.111 −0.793 −0.31829 N-Nitrosoallyl-2,3-dihydroxypropylamine 0.825 0.084 −0.592 0.67630 N-Nitrosoallylethanolamine 0.491 0.309 −0.210 0.51931 N-Nitrosoallyl-2-oxopropylamine 0.335 0.475 −0.122 0.59732 N-Nitrosocimetidine 3.600 −0.556 −0.336 −0.22033 N-Nitroso-2,3-dihydroxypropylethanolamine 5.980 −0.777 −0.696 −0.08134 3,6-Dihydro-2-nitroso-2H-1,2-oxazine 90.600 −1.957 −0.674 −1.28335 1-Nitroso-5,6-dihydrouracil 0.104 0.983 1.115 −0.13236 1-Nitroso-5,6-dihydrothymine 31.200 −1.494 −1.313 −0.18237 Nitroso-1,2,3,6-tetrahydropyridine 0.037 1.427 0.468 0.95938 N-Nitrosobis(2-hydroxypropyl)amine 0.813 0.090 −0.623 0.71339 Nitrosohydroxyproline 3.840 −0.584 −1.125 0.54140 N-Nitroso-2-hydroxymorpholine 14.900 −1.173 −0.424 −0.74941 Nitrosoiminodiacetic acid 63.900 −1.806 −1.889 0.08442 Nitrosomethylaniline 0.034 1.465 1.152 0.31343 N-Nitrosomethyl-2-hydroxypropylamine 0.049 1.313 0.904 0.40944 N-Nitroso-3-hydroxypyrrolidine 7.650 −0.884 −0.844 −0.04045 N-Methyl-N′-nitro-N-nitrosoguanidine 0.178 0.750 0.785 −0.03546 Nitroso-2-oxopropylethanolamine 1.800 −0.255 −0.113 −0.14347 N-Nitrosobis(2-oxopropyl)amine 0.232 0.635 0.802 −0.16848 Nitrosopipecolic acid 4.990 −0.698 −1.375 0.67649 Nitrosoproline 6.690 −0.825 −0.926 0.10050 Nitroso-2,3-dihydroxypropyl-2-oxopropylamine 0.035 1.453 0.891 0.56351 N-Nitrosomethyl-2,3-dihydroxypropylamine 0.646 0.190 0.508 −0.31852 N-Nitrosopyrrolidine 0.439 0.358 −0.350 0.70853 N-Nitrosopiperazine 5.510 −0.741 0.258 −0.99954 N-Nitrosopiperidine 0.963 0.016 0.159 −0.14255 N-Nitrosomorpholine 0.081 1.094 0.337 0.75756 N-Nitrosodiethanolamine 1.900 −0.279 0.729 −1.00857 N-nitrosothiomorpholine 7.380 −0.868 −1.364 0.49658 1-Nitroso-3,4,5-trimethylpiperazine 0.151 0.821 0.676 0.14559 1-(2-oxopropyl)nitroso-3-(2-chloroethyl)urea 4.490 −0.652 −0.798 0.14660 R(−)-2-Methyl-N-nitrosopiperidine 20.400 −1.310 −0.838 −0.47161 S(+)-2-Methyl-N-nitrosopiperidine 13.200 −1.121 −0.998 −0.12362 Tetrahydro-2-nitroso-2H-1,2-oxazine 17.600 −1.246 −1.052 −0.193

181A.H. Morales et al. / Chemometrics and Intelligent Laboratory Systems 81 (2006) 180–187

Table 2Summary of the dimension and number of each kind of descriptors used in thispaper and their principal authors

Dimension Kind of descriptors Descriptorsnumber

Principal authors

0D Constitutionals 47 Todeschini, R. andConsonni, V.

2D Topologicals 262 Boncher, D. and Rouvray,D.H.

2D Molecular walk count 21 Rücker, G. and Rücker,C.

2D BCUTs 64 Burden, F.R.2D Galvez topological

charge indexes21 Galvez, J., García, R.,

Salabert, M.T. and Soler,R.

2D Various autocorrelationsfrom the molecular graph

96 Moreau, G., And Broto,P.

3D Randić molecular profileindexes

41 Randić, M.

3D Geometrical 58 Randić, M.3D RDF 150 Hemmer, M.C.,

Steinhauer, V. andGasteiger, J.

3D 3D-MORSE 160 Schuur, J. and Gasteiger,J.

3D WHIM 99 Todeschini, R., Lasagni,M. and Marengo, E.

3D GATEWAY 197 Consonni, V. andTodeschini, R.

182 A.H. Morales et al. / Chemometrics and Intelligent Laboratory Systems 81 (2006) 180–187

A potentially useful alternative technique is based onQuantitative Structure–Activity Relationships (QSAR) studies,which involve the generation of mathematical models that linkthe physicochemical and structural properties of chemicals withtheirs biological effects. The QSAR studies, applied to thetoxicology, have reached a great importance in the reduction ofexperimental animals and in the virtual screening of toxiccompound potentials before their syntheses; they give as wellpriority potentially to the evaluation of safe compounds [7,8].

The most informative QSAR analyses are those performedon individual classes of congeneric chemicals, that is to say tochemicals that share a basically similar structure, act by thesame mechanism of action, and share the same rate-limitingmechanism step. For attaining the best modelling results, thechemicals in the set should induce the same well definedbiological effect. This is especially applicable in the case ofrodent carcinogenicity, which is characterized by difficultbiological modelling data due maybe to some inherent sourceof error and variability [9].

On the other hand, QSAR have been traditionally developedby selecting, a priori, an analytical model (typically linear,polynomial, or log-linear) to quantify the correlation betweenselected molecular indices and carcinogenesis, followed byregression analysis to determine model parameters. Althoughthe above approaches have proved useful in some cases, theyhave a number of limitations. In fact, the quantitative relation-ships between structure and carcinogenic potency can be rathercomplex and highly nonlinear, so that determining the optimalanalytical form of the QSAR model makes up a real challenge.Moreover, regression analysis becomes complex and lessreliable as the number of descriptors increases.

For those reasons, the principal aim of the present paper is toobtain a good linear regression equation for predicting thecarcinogenic potency of nitrosocompounds at female rat usingwater as an administration route. For this, different kinds of 0, 1,2 and 3D molecular descriptors are calculated using theDRAGON computer software and three statistical methods forvariable selection are used such as Forward Stepwise (FS),Genetic Algorithm (GA) and a novel strategy developedrecently known as the Replacement Method (RM).

2. Materials and methods

2.1. Biological dataset

A data set of 62 nitrosocompounds was collected from theCarcinogenic Potency Database (CPDB) established by Goldand Zeiger [10] available at (http://potency.berkeley.edu/cpdb.html). The CPDB is a single standardized resource ofinformation from many chronic, long-term bioassays. Itcontains a large diversity of chemical structures (more than1300 substances), and includes tumour data reproduced from allof the NCI/NTP rodent bioassay Technical Reports (CPDB-NCI/NTP), as well as additional data extracted from over 1200literature sources (CPDB-Lit) subjected to extensive review[11]. In this paper we used the reports of CPDB-Lit, with theintention of comparing these results with those of CPDB-NCI/

NTP in future papers. The data from the female rat and routewater were used, since it is known to be the species that offerresults more reproducible in the carcinogenicity bioassay [12].

The endpoint selected for this study was the carcinogenicpotency (TD50). The TD50 value for a given target site (s), in theabsence of tumors in control animals, was taken to be thechronic dose (in mg/kg body weight/day) which induced tumorsin half of the test animals [13]. Moreover, a strong correlationbetween carcinogenic potency estimated from epidemiologicdata and that estimated from animal carcinogenesis bioassayshas been demonstrated, and consequently also between humanand rodent carcinogenic potency [14,15]. The similar ranking ofcarcinogenic potency in different species suggests that potencyis an intrinsic property of a chemical carcinogen that is derivedto a greater extent from its chemical reactivity. The names of thecompounds, the calculated and experimental values of logTD50, as well as the residual values, are shown in Table 1.

3. Molecular descriptors

The molecular descriptors for the given compounds arecalculated using the DRAGON software [16], on top of the(x,y,z)-atomic coordinates of the minimal energy conforma-tions determined by the AM1 method using the MOPAC 6.0Package [17]. A total of 1216 molecular descriptors ofdifferent types are calculated to describe compound structuraldiversity. The descriptor typology is as follows (Table 2).

The list of these molecular descriptors, and their meaning, isprovided with literature references by the DRAGON package;the calculation procedure is explained in detail, with related

Table 3Symbols for the QSTR descriptors and their definitions

Symbols Descriptor definition

ATS1m Broto–Moreau autocorrelation of a topological structure − lag1 /weighted by atomic masses

ATS6m Broto–Moreau autocorrelation of a topological structure − lag6 /weighted by atomic masses

ATS8e Broto–Moreau autocorrelation of a topological structure − lag8 /weighted by atomic Sanderson electronegativities

BELm7 Lowest eigenvalue n. 7 of Burden matrix /weighted by atomicmasses

GGI7 Topological charge index of order 7MATS2v Moran autocorrelation − lag 2 /weighted by atomic van der

Waals volumesMATS8p Moran autocorrelation − lag 8 /weighted by atomic

polarizabilitiesMor28m 3D-MoRSE — signal 28 /weighted by atomic massesMor28v 3D-MoRSE — signal 28 /weighted by atomic van der Waals

volumesPW5 path/walk 5 — Randic shape indexR+2m Rmaximal autocorrelation of lag 2 /weighted by atomic massesR+3u R maximal autocorrelation of lag 3 /unweightedRDF045u Radial distribution function — 4.5 /unweightedRDF050m Radial distribution function— 5.0 /weighted by atomic masses

183A.H. Morales et al. / Chemometrics and Intelligent Laboratory Systems 81 (2006) 180–187

literature references, in the Handbook of Molecular Descriptors[18]. However, we provide in Table 3 the meaning of thedescriptors used in the models of the present study.

Descriptors with constant or near constant values inside eachgroup were discarded. All molecular descriptors are used forbuilding QSAR models by Multiple Regression Analysis(MRA), as implemented in the STATISTICA software (version6.0) [19].

4. Algorithms for the search of an optimal set of descriptors

There are several methods available for search an optimalset of descriptors [20–23]; a famous one that has long beenused in QSAR-QSPR applications is the Forward StepwiseRegression Method [24]. The Genetics and EvolutionaryAlgorithms have also been employed [25–27], and theElimination Method (EM) which, in spite of its remarkablesimplicity, yields results in correspondence with the optimalsearch of descriptors [28,29].

The above-mentioned methods of variable selection areapplied in this paper. The Forward stepwise (FS) is theprocedure “step by step”. The process selection begins withoutany independent variable in the regression equation. In eachstep, it introduces variables that present the higher correlationwith the property, but at the same time it analyzes thesignification of the variable included previously in theregression model. If this variable lost the signification, then itis removed. The process is stopped when there is noindependent variable outside of the equation that satisfies theselection criterion.

The genetic algorithm (GA) [25–27] introduced by JohnHolland has been considered superior to other method ofvariable selection techniques. It is a search paradigm inspired bynatural evolution where the variables are represented as genes

on a chromosome (model). It is similar to simplex optimizationand evolves a group of random initial models (population) withfitness scores and searches for chromosomes with better fitnessfunctions (response function scores) through natural selectionand the genetic operators, mutation and recombination. Thenatural selection guarantees the propagation of chromosomeswith better fitness in future populations. The GA combinesgenes from two parent chromosomes using the geneticrecombination operator to form two new chromosomes(children) that have a high probability of having better fitnessthan their parents and also explores new response surface (localoptima) through mutation. The GAs offering a combination ofhill-climbing ability (natural selection) and a stochastic method(recombination and mutation) are very flexible because theyoptimize on a representation of variables, not the variablesthemselves. In addition the GAs provide efficient optimizationas they use implicit parallelism to process information quicklyand require fewer response function evaluations than otherautomated numerical optimization algorithms.

In the case of EM, it has been introduced an alternativemethod called Replacement Method [30]. This new procedureconsists of replacing a chosen variable of the set by another onethat minimizes the total standard deviation (Stot), and that is thereason for its designation (Replacement Method; herein afterRM). To this end, d descriptors {X1, X2,…Xd} are chosen atrandom and a linear regression is performed. One of thedescriptors of this set, say Xi, is chosen and replaced by each ofthe D descriptors of the pool (except itself) keeping the bestresulting set. Since one can start replacing any of the ddescriptors in the initial model, then a regression equation withd variables has d possible paths to achieve the final result; forexample, the choice above will develop into path i. Then, thevariable with greatest relative error in its coefficient is chosen(except the one replaced in the previous step) and replaced withall the D descriptors (except itself) keeping again the best set.All the remaining variables are replaced in the same waybypassing those replaced in previous steps. When finishing, itstarts again with the variable having greatest relative error in thecoefficient and repeats the whole process. This process isrepeated as many times as necessary until the set of descriptorsremains unchanged. At the end, the best model for the path i isattained. This is carried on in exactly the same way for allpossible paths i=1, 2,…d, being the resulting models compared,and keeping the best one. Our numerical experiments show thatin this way one obtains a model almost as good as the best onewith much less than D! / [(D−d)!d!] (where D! and d! arefactorial operations symbols) linear regressions when thiscombinatorial number is large.

5. Selection of the best statistical model for predicting TD50

The models obtained by each method of selection variablewere compared. The statistical significance of them wasdetermined by examining the regression coefficient, thestandard deviation and the ratio between the number of casesand the number of adjustable parameters in the model (ρ) [31].This statistical parameter should be ρ≥4.

184 A.H. Morales et al. / Chemometrics and Intelligent Laboratory Systems 81 (2006) 180–187

In addition, the predictive power of each model was assessedby applying the Leave-One-Out (LOO) and Leave-Group-Out(LGO) cross-validation procedure, which involves extractingone or chemical group (in this case 25% of the data) from thetraining set at a time. For each resulting reduced data set, themodel is re-calculated and generated again, and the results forthe deleted molecules are predicted from the new model.Statistical parameters, such as the determination coefficient andthe standard deviation of the LOO and LGO cross-validationprocedure, were calculated to determine the predictivity of eachnew model generated during the iterative process.

6. Orthogonalization procedure

The Randiæ orthogonalization procedure [32–34] wasapplied to each variable of the model, obtaining thecorresponding orthogonal variables in order to avoid collin-earity among descriptors, and model over-fitting. Firstly, theappropriate order of orthogonalization was selected as theorder in which the variables were input in the model accordingto the statistical technique under study. For example, in thecase of RM technique, the first variable MATS2v was taken asthe first orthogonal descriptor Ω1MATS2v, and the second onewas orthogonalized with respect to it by taking the residual ofits correlation with Ω1MATS2v. The process was repeateduntil all variables were completely orthogonalized; theorthogonal variables were then standardized for use in thenew model.

7. Results and discussion

The regression models for predicting TD50 of nitrosocom-pounds obtained by different methods of variable selection aregiven below:

Model 1 obtained by Forward stepwise.

−logTD50 ¼ 5:96−26:55dPW5−10:87dRþ2mþ 1:56dATS6m

þ 2:98dMor28m−1:17dATS8e−5:74dRþ3u−7:66dGGI7 ð1Þ

N=62

R2=0.665 S=0.595 F=15.281 pb10−5 ρ=7.75 qLOO2 =0.583 SCV-LOO=0.664 qLGO

2 =0.532

SCV-LGO=0 713 .

Model 2 obtained by Genetic Algorithm.

−logTD50 ¼ 6:41−30:46dPW5−11:63dRþ2mþ 1:56dATS6m−2:58dBELm7þ 0:13dRDF045u−6:47dRþ3u−1:15dATS8e ð2Þ

N=62

R2=0.694 S=0.568 F=17.488 pb10−5 ρ=7.75 qLOO2 =0.616 SCV-LOO=0.637 qLGO

2 =0.578

SCV-LGO=0 673 .

Model 3 obtained by Replacement Method.

−logTD50 ¼ 9:90þ 2:88dMATS2v−8:96dATS1m−7:00dRþ3uþ 1:75dATS6m−0:29dRDF050m−1:92dMATS8p

þ 10:20dMor28v ð3Þ

N=62

R2=0.731 S=0.533 F=20.960 pb10−5 ρ=7.75 qLOO2 =0.672 SCV-LOO=0.551 qLGO

2 =0.645

SCV-LGO=0 584 .

where N is the number of compounds included in the model; Sis the standard deviation of the regression; R2 is the squaredcorrelation coefficient; F is the Fisher ratio; p is thesignificance of the model; ρ is the ratio between adjustedparameters in the model and the compounds in the training set;SCV-LOO, SCV-LGO and qLOO

2 , qLGO2 are the standard deviation

and square of the correlation coefficient of the LOO and LGOcross-validation respectively.

The above Eqs. (1) (2) and (3) show the best models obtainedby FS, GA and RM variable selection methods, respectively,from a statistical point of view. The model derived by the FSanalysis led to one model (Eq. (1)), with seven variables andreasonable statistical parameters (R2 = 0.665, S=0.595,F=15.281), but poor predictive ability, taking into accountthe coefficient of determination (qLOO

2 =0.583, qLGO2 =0.532)

and the standard deviation of the LOO and LGO cross-validation procedures (SCV-LOO=0.664, SCV-LGO=0.713). Nev-ertheless, these results are improved when the GA is used as thevariable selection method. The quality of the fit for the resultingmodel (Eq. (2)) can be judged by the determination coefficient(R2 =0.694), the standard deviation (S=0.568), and the Fishersestimate of statistical significance (F=17.488) value. Itspredictive ability is characterized by qLOO

2 =0.616, SCV-LOO=0.637, qLGO

2 =0.578 and SCV-LGO=0.673.A comparison of the two above-mentioned methods

indicates that the model obtained by GA is superior to FS interms of quality of fit and the ability of predictions ofcarcinogenicity of these 62 nitrosocompounds. Similar resultswere obtained by Saxena A.K. and Prathipati, P. [35] in acomparative study of stepwise-Multiple Linear Regression(stepwise-MLR), Principal Least Squares (PLS) and GA-MLR, in QSAR models derived for datasets of α1-adrenor-eceptor antagonists and β3-adrenoreceptor agonists. Theauthors showed that the variable selection method implementedin GA-MLR performs better than other standard QSARtechniques like stepwise MLR and PLS on the datasets studied.

The present paper introduces, for the first time, the RM forobtaining QSAR models in the prediction of carcinogenesis.The RM technique is more computationally intense than the FSand GA technique in what regards quality of the fit and makingpredictions for this dataset. The statistical parameters shownalong with Eq. (3) really corroborate that statement. This modelis able to explain 73.1% of the experimental variance inselected data set. This means that the model is able to explain3.7% more variance than the model obtained by the GAprocedure (Eq. (2)) and 6.6% more one than the model obtainedby the FS method (Eq. (1)). At the same time, the modelrepresented in the Eq. (3) presents the lowest standard deviation(S=0.533) and the higher Fisher statistic (F=20.960). Finally,the model obtained with this method shows the best statisticalparameters of cross-validation (qLOO

2 =0.672, SCV-LOO=0.551,qLGO2 =0.645, SCV-LGO=0.584), the reason why it should

Table 4Determination coefficients in model 3 for the variables not orthogonalized and not standardized

MATS2v ATS1m Mor28v ATS6m MATS8p R+3u RDF050m Intercept R2 S

1.979 0.142 0.081 0.93432.208 −2.666 1.917 0.181 0.6762.102 −4.156 5.524 3.278 0.298 0.8302.642 −6.734 7.271 1.121 4.883 0.421 0.7602.427 −7.775 8.596 1.261 −1.011 5.772 0.481 0.7272.592 −8.094 9.497 1.248 −1.447 −2.696 7.146 0.530 0.6972.885 −8.960 10.200 1.755 −1.928 −7.004 −0.295 9.904 0.731 0.533

185A.H. Morales et al. / Chemometrics and Intelligent Laboratory Systems 81 (2006) 180–187

be the best model in making predictions of the carcinogenicpotency to nitrosocompounds not tested.

In conclusion, the QSAR model obtained by using RM as thevariable selection procedure (Eq. (3)) displays the best statisticalparameters, thereby indicating the superiority of RM withrespect to FS and GA for modeling the carcinogenicity of 62nitrosocompounds. For that reason, this will be the modelworked out from now on.

As a result of the analysis of collinearity, low correlationcoefficients were detected among the values of the “indepen-dent variables” of the model represented by Eq. (3), as depictedin the following table.

This is possibly another advantage of the RM. In spite of theindependent variables in model 3 presenting low collinearity,the orthogonalization of the molecular descriptors is carried outbefore making the interpretation of the model. From the point ofview of the authors, the collinearity of the variables should be aslow as possible because the interrelatedness among the differentdescriptors can result in highly unstable regression coefficients,which makes impossible to know the relative importance of aspecified descriptor in the equations [36]. This statement isclearly exemplified by the development of the sequentialprocess of model 3 that is discussed in the following.

In addition, the standardization process of the descriptors isalso carried out for determining the real contribution of eachvariable in the regression equation.

In Table 4 we show the determination coefficients, thestandard deviations of the molecular descriptors withoutorthogonalized and the regression coefficient of each variablein the successive equations obtained “step by step” with theirconstant terms. As can be seen, the regression coefficientspresent a great variability, the same as the intercepts. Thisinstability can produce false results concerning the relativeimportance of each variable on the studied property. In thissense, it is interesting to remark that the variables that present

Table 5Correlation matrix for the seven variables of model 3

ATS1m ATS6m MATS2v

ATS1m 1.00 0.34 0.15ATS6m 1.00 0.12MATS2v 1.00MATS8pRDF050mMor28vR3u+

the biggest variability in the regression coefficients are ATS1m,Mor28v and R+3u, which, in turn, are those that at the sametime present the highest correlation coefficients given for thecorrelation matrix (see Table 5). One way to eliminate thismulti-collinearity among the variables, the instability inregression coefficients of the models and thus to obtain themost real model, is by the orthogonalization process of themolecular descriptors. A type of orthogonalization process wasintroduced by Randiæ for improving the statistical interpreta-tion of the models built by using interrelated descriptors [32–34]. This procedure has been used satisfactorily in severalreports [37–46]. In Table 6 we resume the results of theorthogonalization and standardization process of moleculardescriptors included in model 3. In this case, the last rowcorresponds to the final model with the orthogonalizedmolecular descriptors. One can observe the great stability ofthe regression coefficient after the orthogonalization ofmolecular descriptors.

The QSAR model obtained after the orthogonalization andstandardization process is given below together with thestatistical parameters.

−logTD50 ¼ −0:03þ 0:27dV1MATS2v−0:31dV2ATS1m

þ 0:33dV3Mor28vþ 0:34dV4ATS6m

−0:24dV5MATS8p−0:22dV6Rþ3u

−0:43dV7RDF050m ð4Þ

N=62

MATS8p

0.190.070.171.00

R2=0.731

RDF

0.030.250.060.111.00

S=0.533

050m

F=20.960

Mor28v

0.580.180.080.120.071.00

pb10−5

ρ=7.75 qLOO2 =0.672 SCV-LOO=0.551 qLGO

2 =0.645

SCV-LGO=0 584 .

The previous regression equation (Eq. (4)) represents thebest theoretical QSTR model obtained for us for predicting thecarcinogenic potency (TD50) of nitrosocompounds, bioassay in

R3u+

0.080.120.170.410.610.131.00

Table 6Determination coefficients in model 4 for orthogonal and standardized variables

Ω1MATS2v Ω2ATS1m Ω3Mor28v Ω4ATS6m Ω5MATS8p Ω6R+3u Ω7RDF050m Intercept R2 S

0.276 −0.028 0.081 0.93430.276 −0.305 −0.028 0.181 0.6760.276 −0.305 0.331 −0.028 0.298 0.8300.276 −0.305 0.331 0.339 −0.028 0.421 0.7600.276 −0.305 0.331 0.339 −0.235 −0.028 0.481 0.7270.276 −0.305 0.331 0.339 −0.235 −0.216 −0.028 0.530 0.6970.276 −0.305 0.331 0.339 −0.235 −0.216 −0.433 −0.028 0.731 0.533

186 A.H. Morales et al. / Chemometrics and Intelligent Laboratory Systems 81 (2006) 180–187

female rat and using water as an administration route. Thismodel presents acceptable statistical parameters, allowing us toexplain 73.10% of the experimental variance, and it also has anadequate predictive power. However, after the outlier extractionprocedure, we found that the following compounds: 3,6-Dihydro-2-nitroso-2H-1,2-oxazine, Diallylnitrosamine, N-Nitrosodiethanolamine, N-Nitrosopiperazine, N-Nitroso-N-iso-butylurea, and N-Nitroso-2-hydroxymorpholine have largevalues of residual, standard residual and deleted residual, andthus they were considered outliers.

These outliers were removed (9.68% of the data set),yielding a finally training set of 56 chemicals. The new modelthus obtained has good statistical parameters, and its equation isgiven below together with the statistical parameters of theregression.

−logTD50 ¼ 0:09þ 0:23dV1MATS2v−0:33dV2ATS1m

þ 0:34dV3Mor28vþ 0:3dV4ATS6m

−0:22dV5MATS8p−0:2dV6Rþ3u

−0:49dV7RDF050m ð5Þ

N=56

R2=0.843 S=0.396 F=36.884 pb10−5 ρ=7.75 qLOO2 =0.780 SCV-LOO=0.470 qLGO

2 =0.754

SCV-LGO=0 525 .

A can be seen, this new model (Eq. (5)) explains 84.3% ofthe experimental variance of the data, with lower standard error(S=0.396). Also the quality of its predictive power may beassessed by means of the LOO and LGO cross-validationparameters (qLOO

2 =0.780, qLGO2 =0.754, SCV-LOO=0.470, SCV-

LGO=0.525). This clearly demonstrates that the final model ismore statistically powerful than the before model (Eq. (4)).

The seven descriptors involved in Eq. (5) can be classified asfollows: (i) four 2-dimensional descriptors: 2D-Autocorrela-tions (Moran autocorrelation − lag 2 /weighted by atomic vander Waals volumes, MATS2v; Broto–Moreau autocorrelation ofa topological structure − lag 1 /weighted by atomic masses,ATS1m; Broto–Moreau autocorrelation of a topologicalstructure − lag 6 /weighted by atomic masses, ATS6m, andMoran autocorrelation − lag 8 /weighted by atomic polarizabil-ities, MATS8p) and (ii) three 3-dimensional descriptors: one3D-MoRSE (3D-MoRSE — signal 28 /weighted by atomicmasses, Mor28m), one GATEWAY (R maximal autocorrelationof lag 3 /unweighted, R+3u) and one RDF descriptor (RadialDistribution Function −5.0 /weighted by atomic masses,RDF050m).

The QSTR model in Eq. (5) includes, in general,descriptors of different nature, but each one encodes somespecific structural feature. For instance, ATS1m and ATS6m(Autocorrelation of a Topological Structure) are moleculardescriptors defined by Broto–Moreau and are calculated fromthe molecular graph by summing all the products wi ·wj (wherew stands for the atomic weight) of all the pairs of atoms i and j forwhich the topological distances equal paths of lengths (lag) 1,and 6, respectively, using atomic masses (m) as the weightingscheme. MATS2v and MATS8p by Moran are calculated in thesame way as the ATSs but with paths of lengths 2 and 8 instead,respectively, and with van der Waals volumes (v) and atomicpolarizabilities (p) as the weight scheme.

The Mor28v, R+3u and RDF050m descriptors are alsoimportant because they take into account the 3D arrangement ofthe atoms without ambiguities (as those appearing when usingchemical graphs), and also because they do not depend on themolecular size, thus being applicable to a large number ofmolecules with great structural variance and being a character-istic common to all of them. The descriptor R+3u belongs to thekind of GATEWAY (GEometry, Topology and Atom-WeightsAssemblY) descriptors, specifically to R-GATEWAY. Thesedescriptors come from the influence/distance matrix (R) wherethe elements of the molecular influence matrix are combinedwith those of the geometric matrix [47]. R+3u represent an Rindex of maximal contribution to the autocorrelation in lag 3(topological distance). This descriptor is expected to have alower dependence on conformational changes since it encodesinformation on pairs of atoms very near each other. For thisreason we can say that the carcinogenic potency for thisdatabase has slighter dependence on the conformationalchanges. Another descriptor that enters in Eq. (5) is theRDF050m, which belongs to the kind of Radial DistributionFunction (RDF) descriptors. Formally, the RDF of an ensembleof A atoms can be interpreted as the probability distribution offinding an atom in spherical volume of radius R. For theRDF050m descriptor, the sphere radius is 0.5 Å and the atomicweights are atomic masses (m). Mor28v is classified as a 3D-MoRSE descriptor (3D-MOlecule Representation of Structurebased on Electron diffraction) and shows certain characteristicsin common with the RDF descriptors. These descriptors arebased on the idea of obtaining information from the 3D atomiccoordinates by resorting to the transforms used in electrondiffraction studies for preparing theoretical scattering curves. Ageneralized scattering function, called the molecular transform,can be used as the functional basic for deriving, from a known

187A.H. Morales et al. / Chemometrics and Intelligent Laboratory Systems 81 (2006) 180–187

molecular structure, the specific analytic relationship of both X-ray and electron-diffraction.

8. Conclusions

The carcinogenic potency for a set of 62 nitrosocompoundsbioassay in female rat was modeled by Multiple RegressionAnalysis, using three methods of variable selection, namelyForward Stepwise Regression Method,Genetics Algorithms andReplacement Methods.

The RM, an alternative of the Elimination Method, isintroduced for the first time for predicting the carcinogenicpotential. The QSAR model obtained by RM showed the beststatistical parameters, revealing therefore its superiority withrespect to FS and GA for modeling the carcinogenicity of thepresent 62 nitrosocompounds. The final attained QSAR modelwas able to explain more than 84% of the experimental varianceafter the removal of 6 chemicals, shown to be outliers.

Acknowledgements

The authors would like to express more sincerely gratitude tothe comments of the anonymous reviewers that helped improvethe quality of the final manuscript and to the editor Professor R.Tauler for useful comments and kind attention.

References

[1] R.D. Combes, Toxicol. In Vitro 14 (2000) 387–399.[2] J. Yang, P. Duerksen-Hughes, Carcinogenesis 19 (1998) 1117–1125.[3] H.L. Wong, S.E. Murphy, M. Wang, S.S Hecht, Carcinogenesis 24 (2003)

291–300.[4] H. Bartsch, B. Ohshima, B. Pignatelli, Mutat. Res. 202 (1998) 307–324.[5] S. González-Mancebo, J. Gaspar, E. Calle, S. Pereira, A. Mariano, J. Rueff,

J. Casado, Mutat. Res. 558 (2004) 45–51.[6] J.C. Louis, Biosilico 1 (2003) 115–116.[7] J.C. Dearden, M.D. Barratt, R. Benigni, D.W. Bristol, R.D Combes, M.T.

D. Cronin, P.N. Judson, M.P. Payne, A.M. Richard, M. Tich, A.P. Worth, J.J. Yourick, ATLA 25 (1997) 223–252.

[8] A.H. Morales, M.A. Cabrera, M.P González, R.R. Molina R.R., H.D.González, Bioorg. Med. Chem. 13 (2005) 2477–2488.

[9] R. Benigni. SAR and QSAR of mutagens and carcinogens: understandingactions mechanism and improving risk assessment. Ed. CRC LLC. (2003)Chapter 9. pp 260–282.

[10] L.S. Gold, E. Zeiger, Handbook of Carcinogenic Potency and GenotoxicityDatabases, CRC Press, Boca Roca, FL., 1997.

[11] L.S. Gold, N.B. Manley, T.H. Slone, L. Rohrbach, Environ. HealthPerspect., Suppl. 107 (Suppl. 4) (1999) 527–602.

[12] E. Gottmann, S. Kramer, B. Pfahringer, Ch. Helma, Environ. HealthPerspect. 109 (2001) 1–11.

[13] R. Benigni, A. Giuliani, Environ. Carcinog. Ecotoxicol. Rev. 17 (1999)45–67.

[14] B.C. Allen, K.S. Crump, A.M. Shipp, Risk Anal. 8 (1988) 531–544.[15] G. Goodman, R. Wilson, Regul. Toxicol. Pharmacol. 14 (1991) 118–146.[16] R. Todeschini, V. Consonni, M. Pavan. DRAGON. Software version 2.1,

2002.[17] J.J.P. Stewar, MOPAC Manual, 6th ed., Frank J. Seiler Research

Laboratory, U.S. Air Force academy, Colorado Springs, CO, 1990, p. 189.[18] R. Todeschini, V. Consonni (Eds.), Handbook of Molecular Descriptors

(Methods and Principles in Medicinal Chemistry), Wiley-VCH,Weinheim,2000.

[19] StatSoft, Inc. STATISTICA 6.0, version 6.0, 2002.[20] D. J Livingstone, E. Rahr, Quant. Struct.-Act. Relatsh. 8 (1989) 103–108.[21] J.W. McFarl, D.J. Gans, Quant. Struct.-Act. Relatsh. 13 (1994) 11–17.[22] K. Héberger, R. Rajkó, SAR-QSAR Environ. Res. 13 (2002) 541–554.[23] K. Héberger, R. Rajkó, J. Chemom. 16 (2002) 436–443.[24] N.R. Draper, H. Smith, Applied Regression Analysis, Second edition, John

Wiley and Sons, New York, 1981.[25] H. Kubinyi, J. Chemom. 10 (1996) 119–133.[26] H. Kubinyi, Quant. Struct.-Act. Relatsh. 13 (1994) 285–294.[27] H. Kubinyi, Quant. Struct.-Act. Relatsh. 13 (1994) 393–401.[28] P.R Duchowicz, F.M. Fernández, E.A. Castro, J. Fluorine Chem. 125

(2004) 43–48.[29] I.V Nesterov, A.A Toropov, P.R Duchowicz, E.A. Castro. The Scientific

World Journal (in press).[30] P.R Duchowicz, F.M. Fernández, E.A. Castro, MATCH Commun. Math.

Comput. Chem. 55 (2006) 179–192.[31] R. Garcia-Domenech, J.V. Julian-Ortiz, J. Chem. Inf. Comput. Sci. 38

(1998) 445–449.[32] M. Randić, J. Chem. Inf. Comput. Sci. 31 (1991) 311–320.[33] M. Randić, New J. Chem. 15 (1991) 517–525.[34] M. Randić, J. Mol. Struct. (Theochem) 233 (1991) 45–59.[35] A.K. Saxena, P. Prathipati, SAR QSAR Environ. Res. 14 (2003) 433–445.[36] A.H. Morales, M.A. Cabrera, R.D. Combes, M.P. González, Curr. Comp.

Aid. Drug Des. 3 (2005) 237–255.[37] H. González-Díaz, R. Ramos de Armas, R. Molina, Bioinformatics 19

(2003) 2079–2087.[38] H. González-Díaz, E. Uriarte, Biopolymers 77 (2005) 296–303.[39] H. González-Díaz, A. Perez-Bello, E. Uriarte, Polymer 46 (2005)

6461–6473.[40] H. González-Díaz, I. Bastida, N. Castañedo, O. Nasco, E. Olazabal, A.

Morales, H.S. Serrano, R. Ramos de Armas, Bull. Math. Biol. 66 (2004)1285–1311.

[41] M.P. González, C. Terán, Y. Fall, M. Teijeira, P. Besada, Bioorg. Med.Chem. 13 (2005) 601–608.

[42] M.P. González, P.L. Suarez, Y. Fall, G. Gomez, Bioorg. Med. Chem. Lett.15 (2005) 5165–5169.

[43] M.P. González, A.H. Morales, R. Medina, R.R. Molina, Internet Electron.J. Mol. Des. 4 (2004) 200–208.

[44] A.H. Morales, M.P. González, J.B. Rieumont, Polymer 45 (2004)2045–2050.

[45] M.P. González, C. Terán, M. Teijeira, M.J. González-Moa, Eur. J. Med.Chem. 40 (2005) 1080–1086.

[46] H. González-Díaz, Y. Marrero, I. Hernandez, I. Bastida, E. Tenorio, O.Nasco, E. Uriarte, N. Castañedo, M.A. Cabrera, E. Aguila, O. Marrero, A.Morales, M.P. González, Chem. Res. Toxicol. 16 (2003) 1318–1327.

[47] V. Consonni, R. Todeschini, M. Pavan, J. Chem. Inf. Comput. Sci. 42(2002) 682–692.