Optimal descriptor as a translator of eclectic data into endpoint prediction: Mutagenicity of...

7
Optimal descriptor as a translator of eclectic data into prediction of cytotoxicity for metal oxide nanoparticles under different conditions Alla P. Toropova a , Andrey A. Toropov a,n , Robert Rallo b , Danuta Leszczynska c , Jerzy Leszczynski d a IRCCS, Istituto di Ricerche Farmacologiche Mario Negri, Via La Masa 19, 20156 Milano, Italy b Departament dEnginyeria Informatica i Matematiques, Universitat Rovira i Virgili, Av. Països Catalans, 26, 43007 Tarragona, Catalunya, Spain c Interdisciplinary Nanotoxicity Center, Department of Civil and Environmental Engineering, Jackson State University, 1325 Lynch Street, Jackson, MS 39217- 0510, USA d Interdisciplinary Nanotoxicity Center, Department of Chemistry and Biochemistry, Jackson State University, 1400 JR Lynch Street, P.O. Box 17910, Jackson, MS 39217, USA article info Article history: Received 14 July 2014 Received in revised form 1 October 2014 Accepted 3 October 2014 Keywords: QSAR Quasi-SMILES Quasi-QSAR, Nano-QSAR Monte Carlo method Cytotoxicity Metal oxide nanoparticle abstract The Monte Carlo technique has been used to build up quantitative structureactivity relationships (QSARs) for prediction of dark cytotoxicity and photo-induced cytotoxicity of metal oxide nanoparticles to bacteria Escherichia coli (minus logarithm of lethal concentration for 50% bacteria pLC50, LC50 in mol/L). The representation of nanoparticles include (i) in the case of the dark cytotoxicity a simplied molecular input-line entry system (SMILES), and (ii) in the case of photo-induced cytotoxicity a SMILES plus symbol ^ . The predictability of the approach is checked up with six random distributions of available data into the visible training and calibration sets, and invisible validation set. The statistical characteristics of these models are correlation coefcient 0.900.94 (training set) and 0.730.98 (validation set). & Elsevier Inc. All rights reserved. 1. Introduction Nanomaterials become important components of modern every- day life. This requires studies that would reveal their characteristics and provide guidelines to facilitate their safe applications. Predictive models for nanomaterials can be useful for theoretical and practical reasons (Randic, 1991; Cosentino et al., 2000; Balaban et al., 2005; Ivanciuc et al., 2006; Tetko et al., 2008; Bhhatarai et al., 2010; Das and Trinajstic, 2010; Mitra et al., 2010; Duchowicz et al., 2011; Furtula and Gutman, 2011; Afantitis et al., 2011; Toropov et al. 2012b,c; Liu et al., 2013; Cohen et al., 2013; Toropova and Toropov, 2014) to the same extend as models for classicsubstances (organic, inorganic, organometallic) have been used. Many of suggested approaches which are aimed to build up quantitative structureproperty/activity relationships (QSPRs/ QSARs) for nanomaterials were obtained with classicdescriptors (Fourches et al., 2010; Petrova et al., 2011), tested for classicsubstances. However, (owing to the uncertainty of molecular architecture that is related to nanomaterials), the development of fresh nanodescriptors(Leszczynski 2010; Toropova and Toropov, 2013) becomes a necessary task of modern computa- tional approaches focusing on the problem. An attractive and innovative alternative to classicdescriptors are optimal descrip- tors calculated using available eclectic data (Toropova et al., 2013; Toropova and Toropov, 2013). Optimal descriptors (Toropova et al., 2010, 2011, 2012a, Toropov et al., 2010a,b, 2013a,b), could be considered as a transitional step between classicand nanodescriptors. On the one hand, these descriptors can be calculated with data on the molecular structure (i.e. just as classicdescriptors); but on the other hand, these descriptors can be computed using eclectic information about a substance, even without detailed data on its molecular structure (Toropov et al., 2007; Toropova and Toropov, 2013). However, data on various nanoparticles can be represented by special strings which are encoded data on physicochemical and biochemical conditions of impact of the nanoparticles. These SMILES-like strings can be named quasi SMILES, since they represent conditions in contrast of traditional SMILES which represent solely the molecular structures. The paradigm for traditional QSPR/QSAR analyses could be expressed as: Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/ecoenv Ecotoxicology and Environmental Safety http://dx.doi.org/10.1016/j.ecoenv.2014.10.003 0147-6513/& Elsevier Inc. All rights reserved. n Corresponding author. E-mail address: [email protected] (A.A. Toropov). Ecotoxicology and Environmental Safety 112 (2015) 3945

Transcript of Optimal descriptor as a translator of eclectic data into endpoint prediction: Mutagenicity of...

Ecotoxicology and Environmental Safety 112 (2015) 39–45

Contents lists available at ScienceDirect

Ecotoxicology and Environmental Safety

http://d0147-65

n CorrE-m

journal homepage: www.elsevier.com/locate/ecoenv

Optimal descriptor as a translator of eclectic data into prediction ofcytotoxicity for metal oxide nanoparticles under different conditions

Alla P. Toropova a, Andrey A. Toropov a,n, Robert Rallo b, Danuta Leszczynska c,Jerzy Leszczynski d

a IRCCS, Istituto di Ricerche Farmacologiche Mario Negri, Via La Masa 19, 20156 Milano, Italyb Departament d′Enginyeria Informatica i Matematiques, Universitat Rovira i Virgili, Av. Països Catalans, 26, 43007 Tarragona, Catalunya, Spainc Interdisciplinary Nanotoxicity Center, Department of Civil and Environmental Engineering, Jackson State University, 1325 Lynch Street, Jackson, MS 39217-0510, USAd Interdisciplinary Nanotoxicity Center, Department of Chemistry and Biochemistry, Jackson State University, 1400 JR Lynch Street, P.O. Box 17910, Jackson,MS 39217, USA

a r t i c l e i n f o

Article history:Received 14 July 2014Received in revised form1 October 2014Accepted 3 October 2014

Keywords:QSARQuasi-SMILESQuasi-QSAR, Nano-QSARMonte Carlo methodCytotoxicityMetal oxide nanoparticle

x.doi.org/10.1016/j.ecoenv.2014.10.00313/& Elsevier Inc. All rights reserved.

esponding author.ail address: [email protected] (A.A

a b s t r a c t

The Monte Carlo technique has been used to build up quantitative structure–activity relationships(QSARs) for prediction of dark cytotoxicity and photo-induced cytotoxicity of metal oxide nanoparticlesto bacteria Escherichia coli (minus logarithm of lethal concentration for 50% bacteria pLC50, LC50 inmol/L). The representation of nanoparticles include (i) in the case of the dark cytotoxicity a simplifiedmolecular input-line entry system (SMILES), and (ii) in the case of photo-induced cytotoxicity a SMILESplus symbol ‘^ ’. The predictability of the approach is checked up with six random distributions ofavailable data into the visible training and calibration sets, and invisible validation set. The statisticalcharacteristics of these models are correlation coefficient 0.90–0.94 (training set) and 0.73–0.98(validation set).

& Elsevier Inc. All rights reserved.

1. Introduction

Nanomaterials become important components of modern every-day life. This requires studies that would reveal their characteristicsand provide guidelines to facilitate their safe applications. Predictivemodels for nanomaterials can be useful for theoretical and practicalreasons (Randic, 1991; Cosentino et al., 2000; Balaban et al., 2005;Ivanciuc et al., 2006; Tetko et al., 2008; Bhhatarai et al., 2010; Dasand Trinajstic, 2010; Mitra et al., 2010; Duchowicz et al., 2011;Furtula and Gutman, 2011; Afantitis et al., 2011; Toropov et al.2012b,c; Liu et al., 2013; Cohen et al., 2013; Toropova and Toropov,2014) to the same extend as models for “classic” substances(organic, inorganic, organometallic) have been used.

Many of suggested approaches which are aimed to build upquantitative structure–property/activity relationships (QSPRs/QSARs) for nanomaterials were obtained with “classic” descriptors(Fourches et al., 2010; Petrova et al., 2011), tested for “classic”substances. However, (owing to the uncertainty of moleculararchitecture that is related to nanomaterials), the development

. Toropov).

of fresh “nanodescriptors” (Leszczynski 2010; Toropova andToropov, 2013) becomes a necessary task of modern computa-tional approaches focusing on the problem. An attractive andinnovative alternative to “classic” descriptors are optimal descrip-tors calculated using available eclectic data (Toropova et al., 2013;Toropova and Toropov, 2013).

Optimal descriptors (Toropova et al., 2010, 2011, 2012a, Toropovet al., 2010a,b, 2013a,b), could be considered as a transitional stepbetween “classic” and “nanodescriptors”. On the one hand, thesedescriptors can be calculated with data on the molecular structure(i.e. just as “classic” descriptors); but on the other hand, thesedescriptors can be computed using eclectic information about asubstance, even without detailed data on its molecular structure(Toropov et al., 2007; Toropova and Toropov, 2013).

However, data on various nanoparticles can be represented byspecial strings which are encoded data on physicochemical andbiochemical conditions of impact of the nanoparticles. TheseSMILES-like strings can be named “quasi SMILES”, since theyrepresent conditions in contrast of traditional SMILES whichrepresent solely the molecular structures.

The paradigm for traditional QSPR/QSAR analyses could beexpressed as:

Table 1Upper triangle of percentages of identity for random splits.

Set Split 1 Split 2 Split 3 Split 4 Split 5 Split 6

Split 1 Training 100.0a 72.3 72.7 65.1 57.8 69.8Calibration 100.0 16.7 0.0 33.3 0.0 16.7Validation 100.0 16.7 33.3 15.4 16.7 15.4

Split 2 Training 100.0 76.2 58.5 69.8 58.5Calibration 100.0 42.9 0.0 30.8 28.6Validation 100.0 16.7 30.8 33.3 30.8

Split 3 Training 100.0 53.7 65.1 68.3Calibration 100.0 0.0 30.8 28.6Validation 100.0 30.8 0.0 15.4

Split 4 Training 100.0 52.4 70.0Calibration 100.0 15.4 0.0Validation 100.0 30.8 42.9

Split 5 Training 100.0 61.9Calibration 100.0 15.4Validation 100.0 30.8

Split 6 Ttraining 100.0Calibration 100.0Validation 100.0

Where Ni j, is the number of substances distributed into the same set for both the i-thsplit and the j-th splits (set¼training, calibration, and validation); Ni is the number ofsubstances distributed into the set for the i-th split; Nj is the number of substancesdistributed into the set for the j-th split.

a = * + ×Identity N N N(%) /0.5 ( ) 100i j i j,

A.P. Toropova et al. / Ecotoxicology and Environmental Safety 112 (2015) 39–4540

Endpoint¼F(Molecular Structure)In the case of the nanomaterials the paradigm can be modified

as follows:Endpoint¼F(Available Eclectic Data)The available eclectic data can be (i) the molecular structure of

substances which are involved in phenomenon under considera-tion; (ii) presence/absence of photo-inducing; and (iii) any othercircumstances which are able to have influence on the phenom-enon under consideration (Toropova and Toropov, 2013; Toropovand Toropova, 2014).

Consequently, one can define the following hybrid paradigm:Endpoint¼F(Molecular Structure and Available Eclectic Data)Since the above mentioned quasi SMILES are basis for establish-

ing of correlation between impacts (these are not only data on themolecular structure, but any available eclectic data with influenceupon nanoparticles) which are defining the behavior of metaloxide nanoparticles, these correlations can be named as “quasi-QSARs” or “nano-QSARs”. In the present work, the only eclecticfactor is the presence or vice versa absence of photo-inducing,however the number of eclectic components for the quasi-QSAR ornano-QSAR can be larger (Toropova and Toropov, 2013; Toropovand Toropova, 2014).

The aim of the present study is an attempt to build up unitedQSAR model for dark cytotoxicity and photo-induced cytotoxicityof metal oxide nanoparticles to bacteria Escherichia coli, usingoptimal descriptors which are a mathematical function of atomiccomposition and the conditions (i.e. the dark or the photo-inducing).

2. Method

2.1. Data

The numerical data on cytotoxicity of metal oxide nanoparticlesto bacteria E. coli (the concentration of the nanoparticles thatproved to be fatal to 50% of the bacteria E. coli LC50, in mol/L) havebeen taken from the literature (Pathakoti et al., 2014). The negativedecimal logarithm of the LC50 (pLC50) has been examined as theendpoint. Six random distributions of the available data intotraining and calibration sets (these metal oxide nanoparticles areused to build up the model) and validation set (these metal oxidenanoparticles are not involved to build up the model, they are usedto check up predictability of the model) are examined. All thesesplits are prepared according to the following principles: (i) theyare random; (ii) the range of endpoints in each sub-set is similar toranges for other sub-sets; and (iii) these splits are not identical(Table 1). The dark cytotoxicity and photo-induced cytotoxicity areexamined as an united endpoint, owing to application of themodel which is a mathematical function of atomic compositionand conditions (presence or absence of photo-inducing).

2.2. Optimal descriptors

In order to take into account the photo-induction, the symbol‘^ ’ is used. Thus, SMILES used in this work are not equivalenttraditionally used ones (Weininger, 1988, 1990; Weininger et al.,1989). Under such circumstances, the term ‘quasi-SMILES’ is usedto define the name for the used representation of metal oxidenanoparticles, because the quasi-SMILES is the representation ofdata on molecular structure together with condition: presence orabsence of photo-inducing. The presence of photo-inducing in-dicated by symbol ‘^ ’ that is added at the end of traditional SMILES(Table 2).

Thus the optimal descriptors have been calculated as follows:

Σ=DCW T N CW A( , ) ( ) (1)k

where Ak is an attribute of the quasi-SMILES that comprises onesymbol (e.g. ‘O’, ‘V’, etc.) or two symbols which should beexamined as one (e.g. ‘Cu’, ‘Al’, etc.). In the case of dark cytotoxi-city, nanoparticles are represented by SMILES of ACD/ChemSketchsoftware (ACD/I-LAB, 2014), in the case of photo-induced cyto-toxicity, nanoparticles are represented by the SMILES of ACD/ChemSketch software (ACD/I-LAB, 2014) plus symbol ‘^ ’ (Table 2).

The CW(x) is correlation weight for an attribute x, that isextracted from a quasi-SMILES; the T is the threshold to divideattributes into two categories rare (noise) or not rare; the N is thenumber of epochs of the Monte Carlo optimization. Correlationweights are calculated for not rare attributes by the Monte Carlooptimization that gives maximum of determination coefficientbetween DCW(T,N) and pLC50 for the calibration set. The prefer-able values for the Tn and Nn which provides best statistics for thecalibration set should be defined at the preliminary phase of theQSAR analysis (Toropova et al., 2011). Having Tn, Nn, and CW(x)which give maximum of the determination coefficient for thecalibration set, one can define (using data from the training set)the following model:

= + * * *C C DCW T NpLC50 ( , ) (2)0 1

The predictability of the model should be checked up withexternal validation set.

Table 3 contains the numerical data on the correlationweights ofdifferent attributes involved in the modeling process. These are(i) various chemical elements represented traditionally by one (e.g.‘O’, ‘V’) or by two symbols (e.g. ‘La’, ‘Ni’). The symbol ‘¼ ’ representsdouble bonds. The symbol ‘^ ’ represents the photo-inducing. Thesymbols ‘[‘ and ’]’ are used in the classic SMILES for encoding specialgroup or metal (Weininger, 1988, 1990; Weininger et al. 1989). Thus,all attributes have transparent interpretation. The correlation

Table 2The quasi-SMILES of metal oxide nanoparticles, distribution of available data into the “visible” training (t) and calibration (c) sets and “invisible” validation set (v);experimental and calculated pLC50 values.

Distribution in splits Quasi-SMILES for metal oxidenanoparticles

pLC50 in mol/L (Pathakoti et al.,2014)

Eq. (3) Eq. (4) Eq. (5) Eq. (6) Eq. (7) Eq. (8)

1 2 3 4 5 6

v t c t t c O¼[Zn] 5.80 4.8787 5.6397 5.3001 5.8000 5.5619 5.4862t c v t t t [Cu]¼O 4.24 4.5261 4.9409 4.7720 4.5033 4.5950 4.5975t t c t c t O¼[V]O[V]¼O 3.48 3.4218 3.4806 2.8323 3.1451 3.0063 3.2528c t t c t t O¼[Y]O[Y]¼O 5.79 4.8413 5.7780 5.8049 4.8582 5.6924 5.4403t c c t t t O¼[Bi]O[Bi]¼O 3.55 3.5020 3.2568 3.0963 3.3190 3.3510 3.5654t t t t v c O¼[In]O[In]¼O 2.83 2.7830 2.7641 2.8287 2.8271 2.7086 2.1465t c t t c t O¼[Sb]O[Sb]¼O 3.12 3.0677 2.9087 2.9420 3.1404 2.8854 3.0264t v c v v v O¼[Al]O[Al]¼O 2.42 2.0447 1.9188 1.8091 1.8644 2.0027 2.0101t t t c t v O¼[Fe]O[Fe]¼O 2.40 1.9340 2.1031 2.4093 1.5757 2.0115 2.1465c t v t t t O¼[Si]¼O 2.54 1.9472 2.5408 1.9869 2.5464 2.2699 2.5329v c v v t c O¼[Zr]¼O 2.58 1.9472 2.1160 1.8349 2.0905 2.4823 2.3042t t t v v c O¼[Sn]¼O 2.53 2.3730 2.4975 2.4194 2.3363 2.4409 2.3042t t t t t t O¼[Ti]¼O 2.14 2.9359 3.0362 2.9432 2.9262 2.9419 3.0468t t t c t t [Co]¼O 3.13 2.7212 2.8531 2.7777 2.3693 2.8427 2.8603t v t t c t [Ni]¼O 3.79 4.0297 2.3187 3.8110 3.7899 3.6350 3.4581v c c t c c O¼[Cr]O[Cr]¼O 2.06 1.2274 1.2908 1.1302 1.5730 1.2519 1.2961t t t v t v O¼[La]O[La]¼O 4.96 4.7716 4.8804 4.7911 4.6073 4.9361 4.8107t t t c c t O¼[Zn]\widehat 6.23 5.8662 6.4002 6.2253 6.7572 6.3998 6.2302t t t t t t [Cu]¼O\widehat 5.71 5.5136 5.7014 5.6972 5.4606 5.4329 5.3415c c t t t t O¼[V]O[V]¼O\widehat 3.78 4.4093 4.2411 3.7575 4.1023 3.8443 3.9968t v c t v t O¼[Y]O[Y]¼O\widehat 5.84 5.8288 6.5385 6.7301 5.8155 6.5304 6.1844c t t t t c O¼[Bi]O[Bi]¼O\widehat 4.02 4.4895 4.0172 4.0215 4.2762 4.1890 4.3094t t v v t c O¼[In]O[In]¼O\widehat 3.48 3.7705 3.5246 3.7539 3.7843 3.5465 2.8906v t t c t t O¼[Sb]O[Sb]¼O\widehat 3.66 4.0552 3.6692 3.8672 4.0976 3.7233 3.7704t v t v t t O¼[Al]O[Al]¼O\widehat 2.75 3.0322 2.6793 2.7343 2.8217 2.8407 2.7541t t v t t v O¼[Fe]O[Fe]¼O\widehat 2.54 2.9215 2.8636 3.3345 2.5329 2.8494 2.8906c v t c t v O¼[Si]¼O\widehat 2.92 2.9347 3.3013 2.9121 3.5036 3.1079 3.2770v c c t v v O¼[Zr]¼O\widehat 3.04 2.9347 2.8765 2.7601 3.0477 3.3202 3.0483t t t v t v O¼[Sn]¼O\widehat 3.24 3.3605 3.2579 3.3446 3.2935 3.2789 3.0483t t t t t t O¼[Ti]¼O\widehat 4.68 3.9234 3.7966 3.8684 3.8834 3.7798 3.7908t t t t v t [Co]¼O\widehat 3.33 3.7087 3.6136 3.7029 3.3265 3.6807 3.6043v v v c t t [Ni]¼O\widehat 3.87 5.0172 3.0792 4.7362 4.7471 4.4729 4.2021t t t t t t O¼[Cr]O[Cr]¼O\widehat 2.06 2.2149 2.0513 2.0554 2.5302 2.0898 2.0401t t t t c t O¼[La]O[La]¼O\widehat 5.56 5.7591 5.6409 5.7163 5.5645 5.7741 5.5547

A.P. Toropova et al. / Ecotoxicology and Environmental Safety 112 (2015) 39–45 41

weights of blocked attributes are equal to 0.0, i.e. these have noinfluence on the model. Table 4 contains an example of thecalculation of DCW(Tn,Nn) and pLC50.

3. Results and discussion

The search for Tn and Nn has been carried out in the ranges (i) Tfrom 1 to 3; and (ii) N from 1 to 20 (Toropova et al., 2011). Thedeveloped models are the following:

Split 1

= ± + ± *DCWpLC50 2.4451( 0.0246) 0.7074( 0.0089) (1, 7) (3)

Split 2

= ± + ± *DCWpLC50 2.5185( 0.0230) 0.8969( 0.0071) (1, 12) (4)

Split 3

= ± + ± *DCWpLC50 2.4390( 0.0204) 0.7537( 0.0050) (1, 12) (5)

Split 4

= ± + ± *DCWpLC50 3.0378( 0.0246) 0.7113( 0.0101) (1, 11) (6)

Split 5

= ± + ± *DCWpLC50 1.5185( 0.0334) 0.8370( 0.0110) (1, 9) (7)

Split 6

= ± + ± *DCWpLC50 2.9489( 0.0252) 0.7092( 0.0095) (1, 11) (8)

Table 5 contains the preferable values of Nn together with thestatistical quality of models for six random splits. Preferablethreshold for all models is Tn¼1. One can see that for six splitsstatistical quality of the model calculated using the describedapproach is the following: the range of the standard error ofestimation for the training set 0.293–0.370; and the range ofstandard error of estimation for the validation set (n¼6 or 7; thesemetal oxide nanoparticles are not involved to build up the model)is 0.367–0.858. The models suggested in the literature (Pathakotiet al., 2014), separately for dark cytotoxicity and photo-inducedcytotoxicity, are characterized for the validation set (n¼4) by thefollowing parameters: s¼0.52; and s¼0.88, for the two forms ofcytotoxicity above, respectively. Thus, the statistical quality ofpredictions with Eqs. (3)–(8) is comparable with the above-mentioned model, based on quantum mechanics descriptors(Pathakoti et al., 2014). Table 3 contains the correlation weightsfor calculations with Eqs. (3)–(8). Fig. 1 graphically representsmodels calculated with Eqs. (3)–(8).

Having data on the correlation weights obtained in several runsof the Monte Carlo optimization with the preliminarly defined Tn

and Nn, one can select four classes of the attributes: (i) stablepromoters of pLC50 rise, i.e. all runs give positive correlationweights; (ii) stable promoters of pLC50 decay, i.e. all runs result in

Table 3Correlation weights for calculation DCW(Tn,Nn).

Ak CW(Ak) Frequency in trainingset

Frequency in calibrationset

Eq. (3)¼ �0.14674 23 5Al 0.17345 2 0Bi 1.20345 1 1Co 0.74595 2 0Cr �0.40424 1 0Cu 3.29735 2 0Fe 0.09521 2 0O �0.20138 23 5In 0.69530 2 0La 2.10083 2 0Ni 2.59570 1 0V 1.14679 1 1Sb 0.89652 1 0Si 0.0 0 2Y 2.15007 1 1Sn 0.60191 2 0Ti 1.39757 2 0[ �0.00385 23 5\widehat 1.39594 11 3Zn 3.79580 1 0

Eq. (4)¼ �0.00295 21 7Bi 0.74583 1 1Co 0.59579 2 0Cr �0.35010 1 1Cu 2.92349 1 1Fe 0.10270 2 0O �0.22305 21 7In 0.47120 2 0La 1.65091 2 0V 0.87061 1 1Sb 0.55181 1 1Si 0.47366 1 0Y 2.15128 1 0Sn 0.42528 2 0Ti 1.02591 2 0[ 0.00163 21 7\widehat 0.84785 11 2Zn 3.70262 2 0Zr 0.0 0 2

Eq. (5)¼ �0.14512 21 7Al 0.12461 1 1Bi 0.97850 1 1Co 0.87786 2 0Cr �0.32574 1 1Cu 3.52377 1 0Fe 0.52276 1 0O �0.22784 21 7In 0.80095 1 0La 2.10276 2 0Ni 2.24871 1 0V 0.80333 1 1Sb 0.87613 2 0Si 0.20168 1 0Y 2.77526 1 1Sn 0.77543 2 0Ti 1.47037 2 0[ �0.02781 21 7\widehat 1.22745 12 2Zn 4.22435 1 1Zr 0.0 0 1

Eq. (6)¼ �0.07510 20 7Bi 1.02239 2 0Co �0.20142 1 1Cr �0.20489 2 0Cu 2.79865 2 0Fe �0.20298 1 1O �0.17273 20 7

Table 3 (continued )

In 0.67663 1 0La 1.92798 1 0Ni 1.79562 1 1V 0.90015 2 0Sb 0.89686 1 1Si 0.29539 1 1Y 2.10436 1 1Ti 0.82928 2 0[ �0.24530 20 7\widehat 1.34570 10 4Zn 4.62152 1 1Zr �0.34559 1 0

Eq. (7)¼ 0.10158 22 6Al 0.19940 1 0Bi 1.00478 2 0Co 1.35252 1 0Cr �0.24908 1 1Cu 3.44583 2 0Fe 0.20463 2 0O �0.27911 22 6In 0.62103 1 0La 1.95157 1 1Ni 2.29898 1 1V 0.79887 1 1Sb 0.72663 1 1Si 0.84577 2 0Y 2.40333 1 0Sn 1.05005 1 0Ti 1.64851 2 0[ 0.20343 22 6\widehat 1.00105 12 2Zn 4.60092 1 1Zr 1.09949 1 0

Eq. (8)¼ �0.27984 20 7Al �0.09617 1 0Bi 1.00027 1 1Co 0.37719 2 0Cr �0.59955 1 1Cu 2.82661 2 0O �0.12700 20 7In 0.0 0 2La 1.87818 1 0Ni 1.22003 2 0V 0.77992 2 0Sb 0.62029 2 0Si 0.32248 1 0Y 2.32210 2 0Sn 0.0 0 1Ti 1.04695 2 0[ �0.04768 20 7\widehat 1.04907 11 2Zn 4.07959 1 1Zr 0.0 0 1

Table 4Example of calculation of DCW(1,7) for Eq. (3). The representation of metal oxideNP is [Cu]¼O DCW(1,7)¼ΣCW(Ak)¼2.94152; pLC50¼2.4451þ0.7074n2.94152¼4.5261.

Ak CW(Ak)

[ �0.0039Cu 3.2973[ �0.0039¼ �0.1467O �0.2014

A.P. Toropova et al. / Ecotoxicology and Environmental Safety 112 (2015) 39–4542

negative correlation weights; (iii) attributes with unclear role, i.e.there have both positive and nagative correlation weights; and (iv)blocked attributes.

Table 5Statistical quality of models for pLC50 calculated with various distributions (1–6) of available data into the training, calibration, and validation sets. The threshold Tn¼1 for allmodels.

Training set Calibration set Validation set

No. Nn na r2 q2 s F n r2 cRp2 k k′ s n r2 s

1 7 23 0.9250 0.9115 0.347 259 5 0.7279 0.62 0.96 1.01 0.683 6 0.7332 0.8282 12 21 0.9464 0.9384 0.317 335 7 0.9737 0.89 0.99 0.98 0.527 6 0.7905 0.8583 12 21 0.9469 0.9396 0.293 338 7 0.9527 0.84 0.95 1.03 0.705 6 0.8078 0.7214 11 20 0.9276 0.9127 0.339 2312 7 0.7921 0.74 0.97 1.00 0.786 7 0.8965 0.3675b 9 22 0.9081 0.8925 0.354 198 6 0.9943 0.91 0.98 1.01 0.454 6 0.9835 0.4186 11 20 0.9160 0.9006 0.370 196 7 0.9473 0.87 0.91 1.08 0.533 7 0.8961 0.300

a The n is number of metal oxide nanoparticles in set (i.e. training, calibration, or validation); r2 is the determination coefficient; q2 is the cross validated r2; s is standarderror of estimation; and F is the Fischer F-ratio. The cRp

2 is the criterion of Y-randomization (Ojha and Roy, 2011; Veselinović et al., 2013a,b): a model is not chance correlationif cRp

2 is larger than 0.5; the k and k′ are criteria of predictability of a model: both should be close to 1 (Golbraikh and Tropsha, 2002; Melagraki and Afantitis, 2013;Veselinović et al., 2013a,b).

b The best model is marked by bold.

Fig. 1. Graphical representation of models calculated with Eqs. (3)–(8).

A.P. Toropova et al. / Ecotoxicology and Environmental Safety 112 (2015) 39–45 43

A.P. Toropova et al. / Ecotoxicology and Environmental Safety 112 (2015) 39–4544

The analysis of these data for six splits has shown that stablepromoters of the pLC50 increase are photo-inducing (̂), vanadium(V), and Yttrium (Y). The stable promoters of the pLC50 decreaseare oxygen (O), and presence of double bond (¼). More detailedassessment of activities of the studied metals is impossible, sinceeach split leads to a specific role of each metal, that depends ondistribution of other metals. In other words, in order to estimatefunctionality of a metal for the endpoint, the metal should takeplace in a group of various nanoparticles. Hence, data examined inthis work do not provide possibility for such estimation. However,attributes which are common for all six splits allows for themechanistic interpretation of the results of the describedapproach.

The measure of statistical quality of attributes which areinvolved to build up model can be estimated as the following:

⎧⎨⎪

⎩⎪=

| − |+

>A

P A P AN A N A

N Adefect( )

( ) ( )( ) ( )

, if ( ) 0

1, otherwise (9)

k

TRN k CLB k

TRN k CLB kCLB k

where the PTRN(Ak) is the probability of presence of the SAk inSMILES of the training set, i.e.

=P A N Ak N( ) ( )/TRN k TRN TRN

The PCLB(Ak) is the probability of presence of the Ak in SMILES ofthe calibration set, i.e.

=P A N Ak N( ) ( )/CLB k CLB CLB

The NTRN(Ak) is the number (frequency) of SMILES whichcontain Ak in the training set; The NTRN is the total number ofSMILES in the training set; The NCLB(Ak) is the number (frequency)of SMILES which contain Ak in the calibration set (Table 3); TheNCLB is the total number of SMILES in the calibration set.

3.1. The logic

If the probability of an attribute to be in the training set is equalto the probability of the attribute in the calibration set it is theideal situation and the defect is zero. However, this situation is nottypical, i.e. the difference between the probability of an attributein the training set and the probability of the attribute in thecalibration set is not zero. Under such circumstances, the fre-quency of an attribute in the training set and in the calibration setalso should be taken into account: if these are small then thedefect of the attribute must be larger. Finally, if Ak is absent in thecalibration set, the defect(Ak) is maximal. Thus, the measurecalculated with Eq. (9) can be used for estimation of the statisticalsignificance of Ak (Table 3) involved in building up model.

3.2. The criterion definition of domain of applicability for a quasi-SMILES

Having the numerical data on the defect(Ak) one can estimatereliability of the model for a representation of metal oxidenanoparticles by a quasi-SMILES (Table 2): the basic hypothesisis “the probability of the quasi-SMILES to be in the domain ofapplicability is inversely proportional of sum of Ak-defects

∑− − = ADefect quasi SMILES defect( ) (10)k

If the Defect-quasi-SMILES calculated with Eq. (5) is equal tozero this is an ideal situation. However in praxis, the idealsituation is rare. Consequently, one should define some limitationfor the Defect-quasi-SMILES value. The possible selection for the

limit is the following:

− − < * − −Defect quasi SMILES 2 Defect quasi SMILES (11)

where − −Defect quasi SMILES is average of the Defect-quasi-SMILES for the training set.

The inequality 6 should be classified as a semi-qualitativecriterion, because the large value of the Defect-quasi-SMILES isnot the guarantee, the prediction for substance represented by thequasi-SMILES will be poor, and vice versa, the small value of theDefect-quasi-SMILES is not the guarantee that the prediction willbe good. However, “probabilistic” meaning of this criterion is quitetransparent.

The calculations with Eq. (11) were carried out with the CORALsoftware (CORAL, 2014). The percentage of the domain of applic-ability, according to the analysis revealed by this software is 100%,76%, 76%, 71%, 71%, and 71%, for splits 1, 2, 3, 4, 5, and 6 respec-tively. It is traditional logic to define 50% as a threshold forestimation of some quality able to be 100%. Consequently, a splitthat is characterized by domain of applicability of more than 50%should be considered as satisfactory. Thus, six examined splits canbe estimated as “satisfactory splits”.

This work is theoretical one: we try to answer question:“whether it is possible to prepare this kind of models or not?”.Table 5 indicates that all models are more or less satisfactory. Thenext question is “how to extract most reliable prediction for anexternal unknown metal oxide nanoparticle?” On one hand, webelieve that statistical characteristics of external validation set aremore important criterion to estimate a model than the statisticalcharacteristics for the “visible” training set. This conception leadsto selection of model calculated with Eq. (7) for split 5 (Table 5).On the other hand we believe that for the practical prediction ofthe endpoint for an external unknown metal oxide nanoparticlepreferable estimation can be defined as the average over all sixpredictions with using Eqs. (3)–(8).

4. Conclusions

Quasi-QSAR approach was used in this work. Since these quasi-QSARs are oriented to metal oxide nanoparticles they can benamed nano-QSARs. Optimal descriptors calculated with eclecticdata represented by the quasi-SMILES (i.e. atomic composition andpresence/absence of photo-inducing) give statistically robust mod-el for cytotoxicity of metal oxide nanoparticles (Table 5). However,the distribution of data into the training, calibration, and valida-tion sets has significant influence upon the predictive potential ofthese models. Development of the described models was carriedout in accordance with OECD principles (OECD, 2007). Theprobabilistic approach to define the domain of applicability inaccordance with the distribution of available data into the “visible”training and external “invisible” validation sets for the nano-QSARis suggested.

Acknowledgments

We thank EC project PreNanoTox (Contract 309666), the ECproject NanoPUZZLES (Project Reference: 309837), the EU projectPROSIL funded under the LIFE program (project LIFE12 ENV/IT/000154), the National Science Foundation (NSF/CREST HRD-0833178), and EPSCoR (Award #: 362492-190200-01/NSFEPS-090378) for financial support. We also express our gratitude toDr. L. Cappellini, Dr. G. Bianchi and Dr. R. Bagnati for valuableconsultations on the computer science.

A.P. Toropova et al. / Ecotoxicology and Environmental Safety 112 (2015) 39–45 45

References

ACD/I-LAB, ⟨http://www.acdlabs.com⟩, 2014.Afantitis, A., Melagraki, G., Koutentis, P.A., Srimveis, H., Kollias, G., 2011. Ligand-

based virtual screening procedure for the prediction and the identification ofnovel b-amyloid aggregation inhibitors using Kohonen maps and Counter-propagation Artificial Networks. Eur. J. Med. Chem. 46, 497–508.

Balaban, A.T., Khadikar, P.V., Supuran, G.T., Thakur, A., Thakur, M., 2005. Study onsupramolecular complexing ability vis-à-vis estimation of pKa of substitutedsulfonamides: dominating role of Balaban index. Bioorg. Med. Chem. 15,3966–3973.

Bhhatarai, B., Gang, R., Gramatica, P., 2010. Are mechanistic and statistical QSARapproaches really different? MLR studies on 158 cycloalkyl-pyranones. Mol. Inf29, 511–522.

Cohen, Y., Rallo, R., Liu, R., Liu, H.H., 2013. In silico analysis of nanomaterials hazardand risk. Acc. Chem. Res. 46, 802–812.

CORAL, ⟨http//www.insilico.eu/coral⟩, 2014.Cosentino, U., Moro, G., Bonalumi, D., Bonati, L., Lasagni, M., Todeschini, R., Pitea, D.,

2000. A combined use of global and local approaches in 3d-QSAR. Chemom.Intell. Lab. 52, 183–194.

Das, K..Ch., Trinajstic, N., 2010. Comparison between first geometric-arithmeticindex and atom-bond connectivity index. Chem. Phys. Lett. 497, 140–151.

Duchowicz, P.R., Mirifico, M.V., Rozas, M.F., Caram, J.A., Fernandes, F.M., Castro, E.A.,2011. Quantitative structure – spectral property relationships for functionalgroups of novel 1.2.5-thiadiazole compounds. Chemom. Intell. Lab. 105, 27–37.

Fourches, D., Pu, D., Tassa, C., Weissleder, R., Shaw, S.Y., Mumper, R.J., Tropsha, 2010.A quantitative nanostructure–activity relationship modelling. ACS Nano 4,5703–5712.

Furtula, B., Gutman, I., 2011. Relation between second and third geometricarith-metic indices of trees. J. Chemom. 25, 87–91.

Golbraikh, A., Tropsha, A., 2002. Beware of q2!. J. Mol. Graph. Model. 20, 269–276.Ivanciuc, T., Ivanciuc, O., Klein, D.J., 2006. Modeling the bioconcentration factors

and bioaccumulation factors of polychlorinated biphenyls with posetic quanti-tative super-structure/activity relationships (QSSAR). Mol. Divers. 10, 133–145.

Leszczynski, J., 2010. Nano meets bio at the interface. Nat. Nanotechnol. 5, 633–634.Liu, R., Rallo, R., Weissleder, R., Tassa, C., Shaw, S., Cohen, Y., 2013. Nano-SAR

development for bioactivity of nanoparticles with considerations of decisionboundaries. Small 9, 1842–1852.

Melagraki, G., Afantitis, A., 2013. Enalos KNIME nodes: exploring corrosion inhibi-tion of steel in acidic medium. Chemom. Intell. Lab. Syst. 123, 9–14.

Mitra, I., Saha, A., Roy, K., 2010. Exploring quantitative structure–activity relation-ship studies of antioxidant phenolic compounds obtained from traditionalChinese medicinal plants. Mol. Simul. 13, 1067–1079.

OECD, 2007. Guidance Document on the Validation of (Quantitative) Structure–Activity Relationships (Q)SARs] Models, ENV / JM / MONO (2007) 2. ⟨http://www.oecd.org/dataoecd/55/35/38130292.pdf⟩.

Ojha, P.K., Roy, K., 2011. Comparative QSARs for antimalarial endochins: Importanceof descriptor-thinning and noise reduction prior to feature selection. Chemom.Intell. Lab. Syst. 109, 146–161.

Pathakoti, K., Huang, M.-J., Watts, J.D., He, X., Huey-Min Hwang, H.-M., 2014. Usingexperimental data of Escherichia coli to develop a QSAR model for predictingthe photo-induced cytotoxicity of metal oxide nanoparticles. J. Photochem.Photobiol. A 130, 234–240.

Petrova, T., Rasulev, B.F., Toropov, A.A., Leszczynska, D., Leszczynski, J., 2011.Improved model for fullerene C60 solubility in organic solvents based onquantum-chemical and topological descriptors. J. Nanopart. Res. 13, 3235–3247.

Randic, M., 1991. Novel graph theoretical approach to heteroatoms in quantitativestructure–activity relationships. Chemom. Intell. Lab 10, 213–227.

Tetko, I.V., Jaroszewicz, I., Platts, J.A., Kuduk-Jaworska, J., 2008. Calculation oflipophilicity for Pt(II) complexes: experimental comparison of several methods.J. Inorg. Biochem. 102, 1224–1237.

Toropov, A.A., Leszczynska, D., Leszczynski, J., 2007. Predicting thermal conductivityof nanomaterials by correlation weighting technological attributes codes.Mater. Lett. 61, 4777–4780.

Toropov, A.A., Toropova, A.P., Benfenati, E., Leszczynska, D., Leszczynski, J., 2010a.SMILES-based optimal descriptors: QSAR analysis of fullerene-based HIV-1PRinhibitors by means of balance of correlations. J. Comput. Chem. 31, 381–392.

Toropov, A.A., Toropova, A.P., Benfenati, E., Leszczynska, D., Leszczynski, J., 2010b.InChI-based optimal descriptors: QSAR analysis of fullerene[C60]-based HIV-1PR inhibitors by correlation balance. Eur. J. Med. Chem. 45, 1387–1394.

Toropov, A.A., Toropova, A.P., Benfenati, E., Gini, G., Puzyn, T., Leszczynska, D.,Leszczynski, J., 2012c. Novel application of the CORAL software to modelcytotoxicity of metal oxide nanoparticles to bacteria Escherichia coli. Chemo-sphere 89, 1098–1102.

Toropov, A.A., Toropova, A.P., Puzyn, T., Benfenati, E., Gini, G., Leszczynska, D.,Leszczynski, J., 2013a. QSAR as a random event: modeling of nanoparticlesuptake in PaCa2 cancer cells. Chemosphere 92, 31–37.

Toropov, A.A., Toropova, A.P., Benfenati, E., Gini, G., Leszczynska, D., Leszczynski, J.,2013b. CORAL: QSPR model of water solubility based on local and global SMILESattributes. Chemosphere 90, 877–880.

Toropov, A.A., Toropova, A.P., 2014. Optimal descriptor as a translator of eclecticdata into endpoint prediction: mutagenicity of fullerene as a mathematicalfunction of conditions. Chemosphere 104, 262–264.

Toropova, A.P., Toropov, A.A., Benfenati, E., Leszczynska, D., Leszczynski, J., 2010.QSAR modeling of measured binding affinity for fullerene-based HIV-1PRinhibitors by CORAL. J. Math. Chem. 47, 959–987.

Toropova, A.P., Toropov, A.A., Benfenati, E., Gini, G., Leszczynska, D., Leszczynski, J.,2011. CORAL: quantitative structure–activity relationship models for estimatingtoxicity of organic compounds in rats. J. Comput. Chem. 32, 2727–2733.

Toropova, A.P., Toropov, A.A., Rasulev, B.F., Benfenati, E., Gini, G., Leszczynska, D.,Leszczynski, J., 2012a. QSAR models for ACE-inhibitor activity of tri-peptidesbased on representation of the molecular structure by graph of atomic orbitalsand SMILES. Struct. Chem. 23, 1873–1878.

Toropova, A.P., Toropov, A.A., Martyanov, S.E., Benfenati, E., Gini, G., Leszczynska, D.,Leszczynski, J., 2012b. CORAL: QSAR modeling of toxicity of organic chemicalstowards Daphnia magna. Chemom. Intell. Lab. Syst. 110, 177–181.

Toropova, A.P., Toropov, A.A., Puzyn, T., Benfenati, E., Leszczynska, D., Leszczynski, J.,2013. Optimal descriptor as a translator of eclectic information into theprediction of thermal conductivity of micro-electro-mechanical systems. J.Math. Chem. 51, 2230–2237.

Toropova, A.P., Toropov, A.A., 2013. Optimal descriptor as a translator of eclecticinformation into the prediction of membrane damage by means of various TiO2

nanoparticles. Chemosphere 93, 2650–2655.Toropova, A.P., Toropov, A.A., 2014. CORAL software: prediction of carcinogenicity of

drugs by means of the Monte Carlo method. Eur. J. Pharm. Sci. 52, 21–25.Veselinović, A.M., Milosavljević, J.B., Toropov, A.A., Nikolić, G.M., 2013a. SMILES-

based QSAR model for arylpiperazines as high-affinity 5-HT1A receptor ligandsusing CORAL. Eur. J. Pharm. Sci. 48, 532–541.

Veselinović, A.M., Milosavljević, J.B., Toropov, A.A., Nikolić, G.M., 2013b. SMILES-Based QSAR models for the calcium channel-antagonistic effect of 1.4-dihy-dropyridines. Arch. Pharm. 346, 134–139.

Weininger, D., 1988. SMILES, a chemical language and information system. 1.Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci.28, 31–36.

Weininger, D., Weininger, A., Weininger, J.L., 1989. SMILES. 2. Algorithm forgeneration of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29 (1989),97–101.

Weininger, D., 1990. Smiles. 3. Depict. Graphical depiction of chemical structures. J.Chem. Inf. Comput. Sci. 30, 237–243.