Radial basis function neural network-based QSPR for the prediction of critical temperature

11
Computers and Chemistry 26 (2002) 159 – 169 Radial basis function neural network based QSPR for the prediction of critical pressures of substituted benzenes Xiaojun Yao a , Xiaoyun Zhang a , Ruisheng Zhang a , Mancang Liu a , Zhide Hu a, *, Botao Fan b a Department of Chemistry, Lanzhou Uniersity, Lanzhou 730000, China b Uniersite ´ Paris 7 -Denis Diderot, ITODYS 1, Rue Guy de la Brosse, 75005 Paris, France Received 6 April 2001; received in revised form 2 May 2001; accepted 5 June 2001 Abstract The Quantitative Structure – Property Relationship (QSPR) method is used to develop the correlation between structures of a great number of substituted benzenes and their critical pressure. Molecular descriptors calculated from structure alone were used to represent molecular structures. A subset of the calculated descriptors selected using forward stepwise regression was used in the QSPR model development. Multiple Linear Regression and Radial Basis Function Neural Networks are utilized to construct the linear and non-linear prediction model, respectively. To obtain good prediction ability, both topological structure and training parameters of radial basis function neural networks are optimized. The prediction result agrees well with the experimental value of these properties. © 2002 Elsevier Science Ltd. All rights reserved. Keywords: Radial basis function neural networks; QSPR; Molecular descriptor; Critical pressure www.elsevier.com/locate/compchem 1. Introduction Critical pressure is an important fundamental physi- cal property of organic compounds. It is of great importance in chemical engineering, especially in high- pressure conditions and supercritical extracts. This property is also needed in the equation of state for calculation of thermodynamic property, which is of special value in the engineering design process. Consid- ering the importance of this property, it would be useful to develop relationships between the critical pres- sure and structural characteristic of a great number of compounds in order to predict the critical property of new compounds without experimentation. There are many previous studies that aimed at predicting critical property. Of them, the most promising method is to use QSPR, which uses descriptors derived from molecular structure alone representing the character of the molecule. The advantage of this approach over other methods lies in the fact that the descriptors used can be calculated from structure alone and are not dependent on any experimental properties. Once the structure of a compound is known, any descriptor can be calculated no matter whether they are synthesized or not. So once a reliable model is established, we can use this method to predict the property of compounds, which we are going to synthesis. Quantitative Structure – Property Relationships (QSPR) quantify the correlation between the structure of a compound with its physico-chemical property of interest. They have been widely used to correlate the physico-chemical property of a compound with its structure. The main steps involved in QSPR include: data collection, molecular geometry optimization, * Corresponding author. Tel.: +86-931-891-2578; fax: + 86-931-891-2582. E-mail address: [email protected] (Z. Hu). 0097-8485/02/$ - see front matter © 2002 Elsevier Science Ltd. All rights reserved. PII:S0097-8485(01)00093-6

Transcript of Radial basis function neural network-based QSPR for the prediction of critical temperature

Computers and Chemistry 26 (2002) 159–169

Radial basis function neural network based QSPR for theprediction of critical pressures of substituted benzenes

Xiaojun Yao a, Xiaoyun Zhang a, Ruisheng Zhang a, Mancang Liu a,Zhide Hu a,*, Botao Fan b

a Department of Chemistry, Lanzhou Uni�ersity, Lanzhou 730000, Chinab Uni�ersite Paris 7-Denis Diderot, ITODYS 1, Rue Guy de la Brosse, 75005 Paris, France

Received 6 April 2001; received in revised form 2 May 2001; accepted 5 June 2001

Abstract

The Quantitative Structure–Property Relationship (QSPR) method is used to develop the correlation betweenstructures of a great number of substituted benzenes and their critical pressure. Molecular descriptors calculated fromstructure alone were used to represent molecular structures. A subset of the calculated descriptors selected usingforward stepwise regression was used in the QSPR model development. Multiple Linear Regression and Radial BasisFunction Neural Networks are utilized to construct the linear and non-linear prediction model, respectively. Toobtain good prediction ability, both topological structure and training parameters of radial basis function neuralnetworks are optimized. The prediction result agrees well with the experimental value of these properties. © 2002Elsevier Science Ltd. All rights reserved.

Keywords: Radial basis function neural networks; QSPR; Molecular descriptor; Critical pressure

www.elsevier.com/locate/compchem

1. Introduction

Critical pressure is an important fundamental physi-cal property of organic compounds. It is of greatimportance in chemical engineering, especially in high-pressure conditions and supercritical extracts. Thisproperty is also needed in the equation of state forcalculation of thermodynamic property, which is ofspecial value in the engineering design process. Consid-ering the importance of this property, it would beuseful to develop relationships between the critical pres-sure and structural characteristic of a great number ofcompounds in order to predict the critical property ofnew compounds without experimentation. There aremany previous studies that aimed at predicting critical

property. Of them, the most promising method is to useQSPR, which uses descriptors derived from molecularstructure alone representing the character of themolecule. The advantage of this approach over othermethods lies in the fact that the descriptors used can becalculated from structure alone and are not dependenton any experimental properties. Once the structure of acompound is known, any descriptor can be calculatedno matter whether they are synthesized or not. So oncea reliable model is established, we can use this methodto predict the property of compounds, which we aregoing to synthesis.

Quantitative Structure–Property Relationships(QSPR) quantify the correlation between the structureof a compound with its physico-chemical property ofinterest. They have been widely used to correlate thephysico-chemical property of a compound with itsstructure. The main steps involved in QSPR include:data collection, molecular geometry optimization,

* Corresponding author. Tel.: +86-931-891-2578; fax: +86-931-891-2582.

E-mail address: [email protected] (Z. Hu).

0097-8485/02/$ - see front matter © 2002 Elsevier Science Ltd. All rights reserved.

PII: S 0097 -8485 (01 )00093 -6

X. Yao et al. / Computers & Chemistry 26 (2002) 159–169160

molecular descriptor generation, descriptor selection,model development and finally model performanceevaluation. This study can tell us which of the struc-tural factors may play an important role in the determi-nation of a property and it can also develop a methodfor the prediction of the property of new compoundsthat have not yet been synthesized.

To develop a QSPR, molecules must be describedusing a molecular structural descriptor and retain asmuch structural information as possible. Various em-pirical physicochemical parameters or non-empiricalstructural descriptors can be used for this purpose. Inthis paper, we use descriptors that can be derived frommolecule structure alone. The molecular descriptorsused include constitutional, topological, empirical andquantum chemical descriptors.

After the calculation of molecular descriptors, tradi-tional QSPR often uses a linear method, such as multi-ple linear regression (MLR), principal componentregression (PCR) and partial least squares (PLS) in thedevelopment of a mathematical relationship betweenthe structural descriptors and the property. New re-ports have demonstrated the ability of neural networksto correlate various physico-chemical properties withtheoretical descriptors, such as: boiling points (Egolf etal., 1994; Hall and Story, 1996), LogP (Breindl et al.,1997; Beck et al., 2000), gas chromatography retentionindices (Sutter et al., 1997; Pompe et al., 1997) vaporpressure (Beck et al., 2000) and critical property (Egolfet al., 1994; Hall and Story, 1996). Neural networks areparticularly useful in cases where it is difficult to specifyan exact mathematical model, which describes a specificstructure–property relationship (Zupan and Gasteiger,1991; Gasteiger and Zupan, 1993). There exist manymodels of neural networks, which have different ap-proaches both in architecture and in learning al-gorithms. The most often used multi-layeredfeedforward network trained by the back-propagationlearning algorithm has some disadvantages (Walczakand Massart, 2000) such as local minimum; slow con-vergence; time-consuming non-linear iterative optimiza-tion; difficulty in explicit optimum networkconfiguration, etc. In contrast, the RBFNNs allowmodeling of nonlinear data using a linear approach,which guarantees an optimal (unique) solution. Itsparameters can be adjusted by fast linear methods. Ithas advantages of small training times and is guaran-teed to reach the global minimum of error surfaceduring training. The optimization of its topology andlearning parameters are easy to implement. Many prob-lems in chemistry and chemical engineering have beensuccessfully solved by the use of RBFNNs: multivariatecalibration (Fischbacher et al., 1997; Li et al., 2000),QSPR (Lohniger, 1993; Tetteh et al., 1996; Yao et al.,2001) classification (Stubbings and Hutter, 1999; Pulidoet al., 1999), etc.

The goal of the present work was to establish aQSPR model that could be used for the prediction ofcritical properties of substituted benzenes from theirmolecular structures and to show the flexible modelingability of RBF neural network. MLR and radial basisfunction neural network method are utilized to estab-lish quantitative linear and non-linear relationships be-tween critical pressure and molecular descriptors,respectively.

2. Methodology

2.1. Data set

All critical pressure data in the present investigationwere obtained from the literature (Yaws, 1999). Thecompounds include a diverse set of substituted ben-zenes. The functional groups contained among the dataset include alkyl, halogen, alcoholic, phenolic, hy-droxyl, amino, cyano, nitro, thio, ester, ether, ketones,carboxylic acids and aldehyde. A complete list of thecompound names and corresponding experimental criti-cal pressures is shown in Table 1 (interested readers canobtain the data sets on request by email:[email protected]). The compounds range insize from benzene to dodecylbenzene. Molecular weightranges from 78 to 446.57 and critical pressure rangesfrom 12.87 to 88.10 bar. The data set was divided intotwo subsets: a training set of 148 compounds and a testset of 25 compounds. The training set was used toadjust the parameters of the RBFNNs and the test setwas used to evaluate its prediction ability. Leave-one-out cross-validation was used to avoid overfitting of thenetwork.

2.2. Molecular descriptor generation

In the present work, four types of descriptors wereused: constitutional, topological, empirical and quan-tum chemical descriptors. Constitutional descriptors arebasically related to the number of atoms and bonds ineach molecule. Topological descriptors include valenceand non-valence molecular connectivity indices calcu-lated from the hydrogen-suppressed formula of themolecule, encoding information about the size, compo-sition and the degree of branching of a molecule.Empirical descriptors are descriptors that account for aparticular aspect of the molecule. Two of these descrip-tors, the unsaturation index (UI) and the hydrophilicityfactor (Hy), were calculated using the method proposedby Todeschini and colleagues (Todeschini, 1996). Quan-tum chemical descriptors include information aboutbinding and formation energies, partial atomic charge,dipole moment, energy levels in the molecule, vibrationenergies, and inertia moments of molecules. The calcu-

X. Yao et al. / Computers & Chemistry 26 (2002) 159–169 161

Table 1The compounds and the predicted results of critical pressure (bar)

MLR RBFNNsCompounds Critical pressureNo.

1 25.50n-Butyl benzoate 26.2225.9027.10 27.2126.04n-Pentylbenzene2*29.80 29.203 p-tert-Amylphenol 29.8022.40 22.4923.304 Diethyl phthalate

5 22.30m-Diisopopylbenzene 23.2924.5022.90 23.8124.506 p-Diisopropylbenzene24.90 24.837 n-Hexylbenzene 23.8022.70 23.2223.368 1,2,3-Triethylbenzene23.30 23.599* 1,2,4-Triethylbenzene 23.3622.90 23.4923.361,3,5-Triethylbenzene1021.50 21.8811 Hexamethylbenzene 22.3820.70 20.8620.2012 n-Octylbenzene

13 18.901,2,3,4-Tetraethylbenzene 19.2819.3019.60 19.3819.301,2,3,5-Tetraethylbenzene1418.90 19.3415 1,2,4,5-Tetraethylbenzene 19.3021.80 21.5322.8016* p-tert-Octylphenol19.00 19.2317 n-Nonylbenzene 18.9521.60 21.1020.70n-Nonylphenol1818.50 18.4019 Dibutylphthlate 17.5017.60 17.8217.7020 n-Decylbenzene15.80 16.2521 Pentaethylbenzene 16.2216.00 16.6216.7222 n-Undecylbenzene15.00 15.5623* n-Dodecylbenzene 15.7914.10 14.6615.0024 n-Tridecylbenzene13.70 13.8425 Phenyltetradecane 14.1912.90 12.9012.87Phenylhexadecane2629.00 29.0427 Hexachlorobenzene 28.5035.80 34.8534.9028 Chloro-2,4-dinitrobenzene

1,2-Dichloro-4-nitrobenzene 36.00 37.20 37.152938.40 38.3137.2030* 1,2,4-Trichlorobenzene33.50 32.7531 1,3,5-Trinitrobenzene 33.9041.60 42.3146.6032 m-Dibromobenzene40.50 40.5233 m-Chloronitrobenzene 39.8040.20 40.2039.80o-Chloronitrobenzene3440.30 40.8135 p-Chloronitrobenzene 39.8041.80 41.4940.7036 m-Dichlorobenzene42.00 41.2437* o-Dichlorobenzene 40.7042.50 41.8140.7038 p-Dichlorobenzene41.70 43.9639 m-Difluorobenzene 40.6741.00 41.8740.6740 o-Difluorobenzene41.90 43.2741 p-Difluorobenzene 44.0037.60 38.3638.50m-Dinitrobenzene4237.00 37.7043 o-Dinitrobenzene 38.5038.50 39.2238.5044* p-Dinitrobenzene

45 45.19 45.30 45.00Bromobenzene45.80 44.7645.19Chlorobenzene4652.10 52.2947 m-Chlorophenol 53.2052.50 52.1950.0048 o-Chlorophenol53.00 52.9049 p-Chlorophenol 53.2039.20 41.0241.103,4-Dichloroaniline5045.30 45.8551* Fluorobenzene 45.5147.70 46.5745.1952 Iodobenzene

53 44.00 43.00 43.80Nitrobenzene48.40 47.4448.9854 Benzene44.40 46.2755 m-Chloroaniline 45.9044.70 45.9345.9056 o-Chloroaniline

57 45.90 45.00 46.14p-Chloroaniline41.10 41.8844.2058* m-Nitroaniline

X. Yao et al. / Computers & Chemistry 26 (2002) 159–169162

Table 1 (Continued)

Critical pressure MLRCompounds RBFNNsNo.

44.20 43.1059 43.66o-Nitroaniline44.20 42.00p-Nitroaniline 43.596061.30 58.0061 57.99Phenol74.90 70.701,2-Benzenediol 72.586274.90 71.2063 76.871,3-Benzenediol88.10 85.401,2,3-Benzenetriol 88.256447.40 47.7065* 48.54Phenyl mercaptan53.09 50.20Aniline 51.396651.80 54.2067 53.69m-Phenylenediamine51.80 52.60o-Phenylenediamine 50.296851.80 53.8069 50.75p-Phenylenediamine49.10 51.10Phenylhydrazine 50.347027.40 28.8071 26.654-Chloro-3-nitrobenzotrifluoride28.10 28.302,4-Dichlorobenzotrifluoride 28.3772*33.30 33.9073 33.753,4-Dichlorophenyl isocyanate36.80 35.00m-Chlorobenzoyl chloride 34.827428.00 30.5075 30.82Nitrobenzotrifluoride40.60 36.50Benzoyl chloride 38.257640.30 40.9077 40.39o-Chlorobenzoic acid33.40 32.60Benzotrichloride 33.687833.90 34.6079* 36.44Benzotrifluoride42.15 43.50Benzonitrile 43.768040.60 40.9081 40.57Phenyl isocyanate30.40 33.102,4,6-Trinitrotoluene 31.148236.50 36.3083 36.54Benzyl dichloride35.90 36.502,4-Dichlorotoluene 36.308434.00 35.1085 35.252,4-Nitrotoluene34.00 35.502,5-Dinitrotoluene 35.4986*36.00 36.1087 35.982,6-Dinitrotoluene34.00 33.303,4-Dinitrotoluene 32.728834.00 34.4089 34.693,5-Dinitrotoluene44.70 44.80Benzoic acid 43.629049.90 46.8091 47.70p-Hydroxybenzaldehyde49.90 46.50Salicylaldehyde 47.219251.80 56.9093* 60.20Salicylic acid43.70 39.70p-Bromotoluene 39.769439.10 39.9095 39.37Benzyl chloride39.10 39.60o-Chlorotoluene 38.979639.10 40.1097 39.41p-Chlorotoluene38.15 39.90o-Fluorotoluene 40.139838.00 38.6099 38.98o-Nitrotoluene38.00 38.30p-Nitrotoluene 39.25100*37.60 38.60101 37.61o-Nitroanisole41.09 42.30Toluene 41.7110241.75 42.40103 41.16Anisole45.50 49.40Benzyl alcohol 46.3110445.60 50.00105 49.90m-Cresol50.06 50.10o-Cresol 49.2610651.50 50.60107* 49.93p-Cresol49.70 48.40p-Methoxyphenol 47.2410843.20 39.90109 42.13Benzylamine41.54 42.10m-Toluidine 43.2111040.00 42.50111 42.93p-Toluidine43.80 45.40Toluenediamine 43.7911244.03 42.70113 42.42Ethynylbenzene40.00 37.80Styrene 37.06114*38.40 35.60115 36.84Acetophenone36.70 35.50p-Tolualdehyde 36.3511635.90 34.40117 35.17Methyl benzoate

X. Yao et al. / Computers & Chemistry 26 (2002) 159–169 163

Table 1 (Continued)

MLR RBFNNsCritical pressureNo. Compounds

39.50118 38.44o-Toluic acid 38.6039.80 38.9038.60p-Toluic acid11939.40 38.80120 Methyl salicylate 40.9038.30 36.9640.10121* Vanillin37.30122 36.98Ethylbenzene 36.0936.90 36.6435.41123 m-Xylene36.80 36.42124 o-Xylene 37.3437.40 36.8635.11125 p-Xylene44.50 43.46126 p-Ethylphenol 42.9043.40 40.6039.20Phenyl ethanol12743.80 42.84128* 2,4-Xyenol 44.0043.30 42.1043.00129 2,6-Xylenol

130 36.48 43.30 43.193,5-Xylenol39.80 36.1236.27n,n-Dimethylaniline13136.10 36.42132 o-Ethylaniline 37.4034.20 34.2435.70133 p-Phenetidine33.00 32.56134 m-Methylstyrene 32.9032.60 32.3734.70o-Methylstyrene135*33.50 32.91136 p-Methylstyrene 33.6032.50 32.2931.80137 Benzyl acetate

31.80Ethyl benzoate 31.00 31.7413834.40 33.0332.70139 Ethyl vanillin32.30 32.53140 Cumene 32.0932.60 32.4030.40141 o-Ethyl toluene

31.27Mesitylene 31.90 32.14142*33.40 33.2032.00n-Propylbenzne143

34.541,2,3-Trimethylbenzene 31.80 31.8614432.40 32.18145 1,2,4-Trimethylbenzene 32.3232.70 32.1131.10146 Benzyl ethyl ether37.40 35.96147 2-Phenyl-2-propanol 34.9029.30 28.9731.20148 m-Divinylbenzene34.20 34.38149* Methylindene 34.6034.70 34.4534.60Methylindene15026.10 26.63151 Dimethyl phthalate 27.8026.60 27.3127.80152 Dimethyl terephthlate31.20 29.16153 Anethole 29.0030.40 30.0628.87154 n-Butylbenzene28.90 29.12155 sec-Butylbenzene 29.5127.80 28.5529.70156* tert-Butylbenzene28.20 28.14157 1,2,3,4-Tetramethylbenzene 28.4028.50 28.8329.30m-Cymene15828.40 28.60159 o-Cymene 29.3029.00 29.1628.37160 p-Cymene

161 29.20m-Diethylbenzene 29.2928.8029.70 29.7128.03p-Diethylbenzene16228.30 28.46163* 2-Ethyl-m-xylene 30.2028.90 28.7428.80164 2-Ethyl-p-xylene28.40 28.57165 Ethyl-o-xylene 28.8028.90 28.8728.80Ethyl-m-xylene16629.00 28.93167 Ethyl-o-xylene 28.8028.50 28.7927.50168 Ethyl-m-xylene

169 29.20Isobutylbenzene 29.4530.4028.40 27.9929.70170* 1,2,3,5-Tetramethylbenzene28.50 28.30171 1,2,4,5-Tetramethylbenzene 29.3833.50 33.0533.40172 p-tert-Butylphenol

173 31.80n,n-Diethylaniline 28.7028.50

1.767 1.547RMS

* Test set.

X. Yao et al. / Computers & Chemistry 26 (2002) 159–169164

Fig. 1. The typical architecture of the RBFNNs.

regression was employed to develop the linear model ofthe property of interest, which takes the form:

Y=b0+b1X1+b2X2+…+bnXn

In this equation, Y is the property, that is, thedependent variable, X1 to Xn represent the specificdescriptor, while b1 to bn represent the coefficient ofthose descriptor; b0 is the intercept of this equation.

2.5. Radial basis function neural networks theory

RBFNNs can be described as a three-layer feedfor-ward structure. As presented schematically in Fig. 1,the RBFNNs consists of three layers: input layer, hid-den layer and output layer. The input layer does notprocess the information, it only distributes the inputvectors to the hidden layer. The hidden layer ofRBFNNs consists of a number of RBF units (nh) andbias (bk). Each hidden layer unit represents a singleradial basis function, with associated center positionand width. Each neuron on the hidden layer employs aradial basis function as non-linear transfer function tooperate on the input data. The most often used RBF isGaussian function that is characterized by a center (cj)and width (rj). In this paper, considerations were lim-ited to the Gaussian functions with a constant width,which was the same for all dimensions. The RBFfunctions by measuring the Euclidean distance betweeninput vector (x) and the radial basis function center (cj)and performs the non-linear transformation with RBFin the hidden layer as given in:

hj(x)=exp(− ��x−cj ��2/rj2) (1)

In which, hj is the notation for the output of the jthRBF unit. For the jth RBF cj and rj are the center andwidth, respectively. The operation of the output layer islinear, which is given in Eq. (2)

yk(x)= �nh

j=1

wkj hj(x)+bk (2)

where yk is the kth output unit for the input vector x,wkj is the weight connection between the kth outputunit and the jth hidden layer unit and bk is the bias.

The training procedure when using RBF consists ofcalculating the centers, width, and weights. There arevarious ways for selecting the centers, such as randomsubset selection, K-means clustering, orthogonal leastsquares learning algorithm, RBF-PLS, etc. In this pa-per, the forward subset selection routine (Orr, 1995)was used to select the centers from training set samples.The adjustment of the connection weight between hid-den layer and output layer is performed using a least-squares solution after the selection of centers and widthof radial basis functions.

The overall performance of RBFNNs is evaluated interms of root mean squared error (RMS) according tothe equation:

lation of quantum chemistry descriptors was describedas below: the molecules were drawn into HyperChem(1994) and pre-optimized using MM+ molecular me-chanics force field and a more precise optimization isdone with semi-empirical AM1 Hamilton and thusquantum chemical descriptors were obtained. All calcu-lations were carried out at restricted Hartree–Focklevel with no configuration interaction. The molecularstructures were optimized using the Polak–Ribiere al-gorithm until the root mean square gradient was 0.01.The resulted geometry was transferred into software 3DQSAR/WHIM (Todeschini, 1996) which can calculateconstitutional, topological, and empirical descriptors. Afull list of the 35 descriptors calculated is given in Table4.

2.3. Feature selection

Once descriptors were generated, descriptor-screeningmethods are used to select the most relevant descriptorto establish the models to predict the molecular prop-erty. Here, the forward stepwise regression method wasused to choose the subset of molecular descriptors.Forward stepwise regression starts with no model termsand at each step it adds the most statistically significantterm (the one with the highest F-statistic or lowestP-value) until there are none left. It was determined tobe the best model when adding a descriptor no longerimproved the RMS error of the leave-one-out cross-val-idation model.

2.4. Regression analysis

After the descriptor was selected, multiple linear

X. Yao et al. / Computers & Chemistry 26 (2002) 159–169 165

RMS=

��ns

i=1

(yk− yk)2

ns(3)

where yk is the desired output, yk is the actual output ofthe network, and ns is the number of compounds inanalyzed set.

2.6. Neural networks implementation and computationen�ironment

All calculation programs were written in MATLAB(The Mathworks, 1996) M-file and compiled using theMATCOM compiler running the Redhat Linux 6.0operating system on a Pentium 266 PC with 128MRAM.

3. Results and discussion

First, stepwise regression routine was used to developthe linear model for the prediction of the critical pres-sure of substituted benzenes using calculated structuraldescriptors. The best linear model contains nine molec-ular descriptors. The regression coefficients of the de-scriptors and their physical–chemical meaning arelisted in Table 2. This model produced a RMS error of1.76 and a correlation coefficient of 0.983 for thetraining set compounds. The external test set had aRMS error of 1.90 using leave-one-out cross-validation.

The critical pressure of a compound is determined bydifferent interactions between molecules. These interac-tions include dispersion interaction, dipole–dipole in-

teraction, dipole-induced interaction and hydrogenbonding interaction. The descriptors involved in thepresent equation can represent these interactions. Thedispersion interaction is mainly determined by themolecular size, as can be described by three topologicalindices: mean information on magnitude of distance,total path count, Randic indices of 0 order (chi0). Thedipole–dipole interaction was described by the localdipole moment and total charge of the molecule. Thehydrogen bonding information was contained in twoconstitutional descriptors (the number of OH groupsand the number of hydrogen bonding donors) and onequantum chemical descriptor (LUMO energy level).The empirical descriptor Hy is the hydrophilicity indexbased on atom/group counting. It is an informationindex calculated by the following expression:

where nHy, nC and A are the number of hydrophilicgroups (�OH, �NH, �SH, etc.), of carbon atoms, andtotal number of non-hydrogen atoms, respectively. Theminimum Hy value tends to −1 for large aliphaticcompounds and has maximum values for H2O2 (3.64)and H2O (3.44). This descriptor can be interpreted ashydrogen bonding descriptor and size descriptor.

After the establishment of a linear model, a radialbasis function network was used to develop a non-lin-ear model based on the same subset of descriptors. TheRBFNNs has nine inputs (a set of nine moleculardescriptors), one output layer unit (critical pressure)and one hidden layer of nh units. Such a RBFNNs canbe designed as a 9–nh–1 net to indicate the number ofunits in input, hidden layer and output layer, respec-tively. A RBFNNs is completely specified by choosingthe following parameters:

Hy=(1+nHy) · log2 (1+nHy)+nC · (1/A · log2 (1/A)+�nHy/A2

log2 (1+A)

Table 2Descriptors, coefficients, standard error, and T-values for the linear model

Descriptor CoefficientChemical meaning S.E. t-value

Intercept Constant 151.637 6.744 22.483Hy 37.970Hydrophilic index 1.934 19.628

Mean information on magnitude of distance IDM −12.124 1.669 −7.264Number of hydrogen bonding donors −14.9531.885−28.184NHD

Qtot 3.890Total charge 0.426 9.134HOMO energy level HOMO 1.854 0.425 4.360

NOH 10.051Number of �OH groups 0.583 17.253TPC 0.0736Total path count 0.015 4.939

Randic index of 0 order Chi0 −2.756 0.799 −3.449−2.850Local dipole moment 1.090Ldip −2.614

R (correlation coefficient) 0.983RMS (root mean square error) 1.767

X. Yao et al. / Computers & Chemistry 26 (2002) 159–169166

Fig. 2. Predicted versus experimental critical pressure (MLR).

Fig. 3. The width of RBFNNs versus RMS error on LOO cross-validation.

� The number nh of radial basis functions;� The center cj and width rj of each radial basis

function;� The connection weights wkj between the jth hidden

layer unit and kth output unit.The number of radial basis functions (the hidden

layer units) nh greatly influences the performance of a

RBFNNs. If the number is too low, the network maynot calculate a proper estimation of the data. On theother hand, if too many hidden layer units are used, thenetwork tends to overfit the training data. In this paper,the radial basis functions were added one by one andterminated if no performance of the networks wasimproved by adding a new basis function. The centers

X. Yao et al. / Computers & Chemistry 26 (2002) 159–169 167

Table 3A full list of centers selected for RBFNNs

CompoundsNo.

133 p-Phenetidine1-Chloro-2,4-dinitrobenzene281-Phenylhexadecane26p-Dichlorobenzene38Cumene140

4 Diethyl phthalatem-Cresol1051,2,3,4-Tetraethylbenzene13Benzylamine109p-Methoxyphenol108

88 3,4-Dinitrotoluenem-Chloronitrobenzene332,4-Nitrotoluene85n,n-Dimethylaniline131Anisole103

72 2,4-Dichlorobenzotrifluoride29 1,2-Dichloro-4-nitrobenzene53 Nitrobenzene11 Hexamethylbenzene

set samples). The model starts empty, the radial basisfunction to add is the one which reduces the sum ofsquared errors most. This process of adding hiddenunits and increasing the model complexity is continuedtill some criterion such as GCV stops increasing. Thecriterion of the selection used here is an approximationof the leaving-one-out (LOO) cross-validation methods,according to the equation below:

�LOO2=yP(diag(P))−2Py

p(4)

where y is the output of the network, P is the projectionmatrix, which can be computed by P=Ip−ZZt fromthe outputs matrix Z of hidden layer units and the unitmatrix I with dimension p ; p is the pattern number intraining sets. The LOO cross-validation method wasused to prevent the network from overfitting.

After the selection of the centers and number ofhidden layer units, the connection weights can be easilycalculated by linear least square methods.

w=yZ�(ZZ�)−1 (5)

where y is the matrix of training example targets, Z isthe matrix of hidden layer unit outputs, Z� is thetranspose of matrix Z and w is the weight matrixconnection hidden layer and output layer.

The optimal width was selected by experimentingwith a number of trials and selecting the one mostfavored by the model selection criterion: a widthsmaller than 1 gives poor prediction ability, varying thewidth indicates width has little effect on the perfor-mance of RBFNNs if width exceeds 10. So we select the

of RBFNNs are determined with a forward subsetselection method proposed by Orr (1995). The advan-tages of this method over other center selection meth-ods is that it can determine the number of hidden layerunits simultaneously and there is no need to fix thenumber of hidden layer units in advance. This methodalso has a tractable model order selection. This methodgoes through a process of selecting a subset of radialbasis functions from a larger set of candidates (training

Fig. 4. Predicted versus experimental critical pressure (RBFNNs).

X. Yao et al. / Computers & Chemistry 26 (2002) 159–169168

Table 4A full list of descriptors used

Chemical Descriptormeaning

LUMO LUMO energy levelHOMO HOMO energy levelXdip Dipole moment in the x axisYdip Dipole moment in the y axisZdip Dipole moment in the z axisDipole Dipole momentHOF Heat of formation

Total energyETElectron Electronic energyCore Core–core interaction energyQtot Total charge

Local dipole momentLdipUI Unsaturation indexHy Hydrophilic indexNC Number of C atoms

Number of O atomsNONOH Number of OH groupsTPC Total path countNH Number of H atomsNHA Number of hydrogen bonding acceptorsNHD Number of hydrogen bonding donorsMW Molecular weightTPC Total path countChi0 Randic index of 0 orderChi1 Randic index of 1 orderChi2 Randic index of 2 order

Mean information on magnitude ofIDMdistance

ICEN Centric information indexMean information on distance equalityIDE

IDDE Information on equality of distancedegrees

IDDM Information on equality of distancemagnitude

IDMT Total information on magnitude ofdistance

IDET Total information on distance equalityZM1 Zegreb index 1ZM2 Zegreb index 2

network gives RMS of 1.547 bar for the whole set,1.385 bar for the training set, and 2.372 bar for theprediction set. The performance of RBFNN is betterthan that obtained by multiple linear regression (Fig.2). Analysis of the results obtained indicates that themodel we proposed correctly represents the structural–property relationships of these compounds and thatmolecular descriptors calculated solely from structurescan represent the structural features of the compoundsresponsible for their critical properties.

4. Conclusion

A new method for the prediction of critical pressureof substituted benzenes using multiple linear regressionand RBFNNs based on descriptors calculated frommolecular structure alone (Table 4) has been developed.Very satisfactory results were obtained with the pro-posed method. The models proposed could also providesome insight into what structural features are related tothe critical pressure of these compounds. Additionally,nonlinear models using RBFNNs based on these samesets of descriptors produced even better models withgood predictive ability. RBFNNs proved to be a usefultool in the prediction of critical pressure and exhibiteda high speed of learning when compared with multi-lay-ered feedforward neural networks trained with the BPalgorithm. The training procedure is also simpler whenusing RBFNNs because there are fewer parameters tobe optimized: the width of radial basis function and thenumber of units in the hidden layer. Furthermore, theproposed approach can also be extended in otherQSPR investigations.

Acknowledgements

The authors thank the Association Franco-Chinoisepour la Recherche Scientifique et Technique (AFCRST)for supporting this study (Programme PRA SI 00-05).Special thanks are given to Professor Todeschini andother members in the Milano Chemometrics and QSARResearch Group for providing the WHIM-3D/QSAR soft-ware package for use in this research.

References

Beck, B., Breindl, A., Clark, T., 2000. J. Chem. Inf. Comput.Sci. 40, 1046.

Breindl, A., Beck, B., Clark, T., Glen, R.C., 1997. J. Mol.Model. 3, 142.

Egolf, L.M., Wessel, M.D., Jurs, P.C., 1994. J. Chem. Inf.Comput. Sci. 34, 947.

optimal width from 1 to 10 every 0.1. Each minimumerror on LOO cross-validation was plotted versus thewidth (Fig. 3) and the minimum was chosen as theoptimal conditions. In this case: r=3.5 and nh=19.

Through the above process, the best number of hid-den layer units and the optimum width are selected as19 and 3.5, respectively. The selected centers and theirdistributions among training samples were listed inTable 3. As can be seen the centers selected correspondvery well with the distribution of training set samples.From the best network, the inputs in the test set werepresented with it, and the results with RBFNNs wereobtained. They are shown in Table 1 and Fig. 4. The

X. Yao et al. / Computers & Chemistry 26 (2002) 159–169 169

Fischbacher, C., Jageman, K.U., Danzer, K., Muller, U.A.,Papenkordt, L., Schuler, J., 1997. Fresenius J. Anal. Chem.359, 78.

Gasteiger, J., Zupan, J., 1993. Angew. Chem. Int. Ed. Engl.32, 503.

Hall, L.H., Story, C.T., 1996. J. Chem. Inf. Comput. Sci. 36,1004.

HyperChem 4.0, Hypercube, Inc, 1994.Li, Q.F., Yao, X.J., Chen, X.G., Liu, M.C., Zhang, R.S.,

Zhang, X.Y., Hu, Z.D., 2000. Analyst 125, 2049.Lohniger, H., 1993. J. Chem. Inf. Comput. Sci. 33, 736.Orr, M.J.L., 1995. Neural Comput. 7, 606.Pompe, M., Razinger, M., Novic, M., Veber, M., 1997. Anal.

Chim. Acta 348, 215.Pulido, A., Ruisanchez, I., Rius, F.X., 1999. Anal. Chim. Acta

388, 273.

Stubbings, T., Hutter, H., 1999. Chemometr. Intell. Lab. Syst.49, 163.

Sutter, J.M., Peterson, T.A., Jurs, P.C., 1997. Anal. Chim.Acta 342, 113.

Tetteh, J., Metcalfe, E., Howells, S.L., 1996. Chemometr.Intell. Lab. Syst. 32, 177.

Todeschini, R., 1996. WHIM-3D/QSAR Software for the Calcula-tion of the WHIM Descriptors, Release 2.1 for Windows.Talete, Milan.

Walczak, B., Massart, D.L., 2000. Chemometr. Intell. Lab.Syst. 50, 179.

Yao, X.J., Zhang, X.Y., Zhang, R.S., Liu, M.C., Hu, Z.D.,Fan, B.T., 2001. Comput. Chem. 25, 475.

Yaws, C.L., 1999. Chemical Properties Handbook. McGraw-Hill, New York.

Zupan, J., Gasteiger, J., 1991. Anal. Chim. Acta 248, 1.