Prediction of GFP spectral properties using artificial neural network

15
Prediction of GFP Spectral Properties Using Artificial Neural Network CHANIN NANTASENAMAT, 1 CHARTCHALERM ISARANKURA-NA-AYUDHYA, 1 NATTA TANSILA, 1 THANAKORN NAENNA, 2 VIRAPONG PRACHAYASITTIKUL 1 1 Department of Clinical Microbiology, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand 2 Department of Industrial Engineering, Faculty of Engineering, Mahidol University, Nakhon Pathom 73170, Thailand Received 12 September 2005; Accepted 22 February 2006 DOI 10.1002/jcc.20656 Published online 13 February 2007 in Wiley InterScience (www.interscience.wiley.com). Abstract: The prediction of the excitation and the emission maxima of green fluorescent protein (GFP) chromo- phores were investigated by a quantitative structure-property relationship study. A data set of 19 GFP color variants and an additional data set consisting of 29 synthetic GFP chromophores were collected from the literature. Artificial neural network implementing the back-propagation algorithm was employed. The proposed computational approach reliably predicted the excitation and the emission maxima of GFP chromophores with correlation coefficient exceed- ing 0.9. The usefulness of quantum chemical descriptors was revealed by a comparative study with other molecular descriptors. Assignment of appropriate protonation state of the chromophore for the GFP color variants data set was shown to be necessary for good predictive performance. Results suggest that the confinement of the GFP chromo- phore has no significant influence on the predictive performance of the data set used. A comparative investigation with the traditional modeling methods, particularly multiple linear regression and partial least squares, reveals that artificial neural network is the most suitable modeling approach for the GFP spectral properties. It is anticipated that this methodology has great potential in accelerating the design and engineering of novel GFP color variants of scien- tific or industrial interest. q 2007 Wiley Periodicals, Inc. J Comput Chem 28: 1275–1289, 2007 Key words: green fluorescent protein; chromophore; absorbance; fluorescence; neural network; partial least squares; quantitative structure-property relationship Introduction Green fluorescent protein (GFP) is an autofluorescent protein of 238 amino acid residues that is isolated from the outer dermal layer of the Pacific Northwest jellyfish, Aequorea victoria. GFP takes the shape of a -barrel made of 11 -sheets and is 24 and 42 A ˚ in diameter and length, respectively. The chromophore resides in the central -helix and protected within the confine- ment of the -barrel. 1 No cofactors or substrates are needed for GFP to mature and fluoresce, however, the chromophore under- goes a series of post-translational modifications. Once mature, GFP is resistant to a variety of destructive forces such as tem- perature and pH. 2 The popularity of GFP began in 1992 when Prasher et al. cloned and sequenced the cDNA of Aequorea GFP. 3 Shortly after, Chalfie et al. demonstrated that it was possi- ble for GFP to be expressed in a variety of cell types. 4 Many experiments have shown that GFP is amenable to fusion at ei- ther the N- or C- terminus without interfering with the function of both GFP and its fusion protein. Owing to its high stability and flexibility, GFP has been employed as reporters for gene expression, 5 protein localization, 6 protein–protein interaction, 7 protein-lipid interaction, 8,9 structural and behavioral determina- tion of macromolecules, 10 and as analytical sensors. 11,12 The GFP chromophore, p-hydroxybenzylideneimidazolinone, is formed by the cyclization step involving the nucleophilic attack of the amide of Gly67 on the carbonyl of Ser65 to form an imida- zolinone ring. 2 This is followed by an oxygenation step, which is time-dependent and takes approximately 2–4 h for the 1,2-dehy- Contract/grant sponsor: Royal Golden Jubilee Ph.D. Scholarship, Thailand Research Fund Contract/grant sponsor: Thailand Toray Science Foundation Contract/grant sponsor: Mahidol University; contract/grant number: 02012053-0003 Correspondence to: V. Prachayasittikul; e-mail: [email protected] q 2007 Wiley Periodicals, Inc.

Transcript of Prediction of GFP spectral properties using artificial neural network

Prediction of GFP Spectral Properties Using Artificial

Neural Network

CHANIN NANTASENAMAT,1 CHARTCHALERM ISARANKURA-NA-AYUDHYA,1 NATTA TANSILA,1

THANAKORN NAENNA,2 VIRAPONG PRACHAYASITTIKUL1

1Department of Clinical Microbiology, Faculty of Medical Technology, Mahidol University,Bangkok 10700, Thailand

2Department of Industrial Engineering, Faculty of Engineering, Mahidol University, NakhonPathom 73170, Thailand

Received 12 September 2005; Accepted 22 February 2006DOI 10.1002/jcc.20656

Published online 13 February 2007 in Wiley InterScience (www.interscience.wiley.com).

Abstract: The prediction of the excitation and the emission maxima of green fluorescent protein (GFP) chromo-

phores were investigated by a quantitative structure-property relationship study. A data set of 19 GFP color variants

and an additional data set consisting of 29 synthetic GFP chromophores were collected from the literature. Artificial

neural network implementing the back-propagation algorithm was employed. The proposed computational approach

reliably predicted the excitation and the emission maxima of GFP chromophores with correlation coefficient exceed-

ing 0.9. The usefulness of quantum chemical descriptors was revealed by a comparative study with other molecular

descriptors. Assignment of appropriate protonation state of the chromophore for the GFP color variants data set was

shown to be necessary for good predictive performance. Results suggest that the confinement of the GFP chromo-

phore has no significant influence on the predictive performance of the data set used. A comparative investigation

with the traditional modeling methods, particularly multiple linear regression and partial least squares, reveals that

artificial neural network is the most suitable modeling approach for the GFP spectral properties. It is anticipated that

this methodology has great potential in accelerating the design and engineering of novel GFP color variants of scien-

tific or industrial interest.

q 2007 Wiley Periodicals, Inc. J Comput Chem 28: 1275–1289, 2007

Key words: green fluorescent protein; chromophore; absorbance; fluorescence; neural network; partial least squares;

quantitative structure-property relationship

Introduction

Green fluorescent protein (GFP) is an autofluorescent protein of

238 amino acid residues that is isolated from the outer dermal

layer of the Pacific Northwest jellyfish, Aequorea victoria. GFPtakes the shape of a �-barrel made of 11 �-sheets and is 24 and

42 A in diameter and length, respectively. The chromophore

resides in the central �-helix and protected within the confine-

ment of the �-barrel.1 No cofactors or substrates are needed for

GFP to mature and fluoresce, however, the chromophore under-

goes a series of post-translational modifications. Once mature,

GFP is resistant to a variety of destructive forces such as tem-

perature and pH.2 The popularity of GFP began in 1992 when

Prasher et al. cloned and sequenced the cDNA of Aequorea

GFP.3 Shortly after, Chalfie et al. demonstrated that it was possi-

ble for GFP to be expressed in a variety of cell types.4 Many

experiments have shown that GFP is amenable to fusion at ei-

ther the N- or C- terminus without interfering with the function

of both GFP and its fusion protein. Owing to its high stability

and flexibility, GFP has been employed as reporters for gene

expression,5 protein localization,6 protein–protein interaction,7

protein-lipid interaction,8,9 structural and behavioral determina-

tion of macromolecules,10 and as analytical sensors.11,12

The GFP chromophore, p-hydroxybenzylideneimidazolinone,

is formed by the cyclization step involving the nucleophilic attack

of the amide of Gly67 on the carbonyl of Ser65 to form an imida-

zolinone ring.2 This is followed by an oxygenation step, which is

time-dependent and takes approximately 2–4 h for the 1,2-dehy-

Contract/grant sponsor: Royal Golden Jubilee Ph.D. Scholarship, Thailand

Research Fund

Contract/grant sponsor: Thailand Toray Science Foundation

Contract/grant sponsor: Mahidol University; contract/grant number:

02012053-0003

Correspondence to: V. Prachayasittikul; e-mail: [email protected]

q 2007 Wiley Periodicals, Inc.

drogenation of Tyr66. The native GFP possesses two bands of ex-

citation maxima, a major peak at 395 nm and a minor peak at 475

nm. The presence of the two absorption peaks can be attributed to

the different protonation states of the GFP chromophore. The pro-

tonated (neutral) and deprotonated (anionic) forms of the chromo-

phore absorb at 395 and 475 nm, respectively.2

Much interest has been geared toward the engineering of novel

color variants13–15 of the GFP in light of its wide applicability in

the life sciences. These color variants are made possible by muta-

tions on the tripeptide chromophore. On the other hand, modifica-

tions made to the amino acid residues in the immediate vicinity of

the chromophore control its protonation state, which directly influ-

ences the magnitude of the major and minor absorbance peaks.

Several theoretical studies have been reported on various

aspects of the GFP chromophore, particularly the protonation

state,16–19 cyclization,20 solvent effects,21 fluorescence mecha-

nism,22 as well as prediction of the absorbance spectra23,24 and

excitation maxima of some GFP mutants.25 However, there has

been neither report of a computational approach in the predic-

tion of the emission maxima nor a comprehensive investigation

on calculating the spectral properties of a series of GFP color

variants and synthetic GFP chromophores.

We report herein a quantitative structure-property relationship

(QSPR) study of the quantum chemical descriptors calculated

from GFP chromophores and the quantitative prediction of their

excitation and emission maxima via artificial neural networks

(ANN) (Fig. 1 for an overview), multiple linear regression

(MLR), and partial least squares (PLS). The data sets were

derived from two sources: chromophores from GFP color var-

iants and synthetic GFP chromophores. The descriptors were

obtained from single point calculation (B3LYP/6-31G*) using

geometrically optimized structure (HF/3-21G) of the lowest

energy conformer derived from Monte Carlo or systematic con-

formational search. Predictions for the excitation and the emis-

sion maxima were obtained from back-propagation neural net-

work calculations using optimal parameters acquired from an ex-

haustive empirical search. All the predictive models were

validated by leave-one-out cross-validation (LOO-CV). We

investigated the importance of quantum chemical descriptors,

the significance of the chromophore’s protonation state, and

influence of the protective �-barrel, which encapsulates the chro-

mophore, on predictive performance of spectral properties. To

our knowledge, this is the first report on the use of QSPR in pre-

dicting spectral properties of the GFP color variants and the syn-

thetic GFP chromophores.

Computational Details

Data Collection

The excitation and the emission maxima of 19 GFP mutants

were collected from the literature (Table 1 for individual refer-

ence). Spectral data for 29 synthetic GFP chromophores were

taken from the reported work of Follenius-Wund et al.35 The ini-

tial geometries of the protein-confined chromophores (Fig. 2)

and synthetic GFP chromophores (Table 2) were constructed

using the molecule building module of Spartan’04.36 The three-

dimensional structures of the synthetic GFP chromophores were

built according to the work of Follenius-Wund et al. Likewise,

the molecular structures of the chromophore were drawn accord-

ing to their respective reference (Table 1) but with the backbone

carbons replaced by hydrogen atoms. The protonation states of

the chromophore were taken into consideration in the calculation

of the quantum chemical descriptors (see Table 3 for the calcu-

lated descriptors based on chromophores that did not take into

account the protonation state; also see Table 4 for the calculated

descriptors using based on chromophores that took the protona-

tion state into account). The calculated descriptors of synthetic

GFP chromophores are shown in Table 5.

Molecular Descriptor Calculation

The three-dimensional molecular structure of each GFP chromo-

phores served as inputs for the generation of molecular descrip-

tors by RECON37 and E-DRAGON.38

RECON is based on the construction of a library of precom-

puted atomic fragments by ab initio calculations. The RECON

algorithm39 is based on the theory of atoms in molecules (AIM)

developed by Bader et al.40 in which the property of a molecule

could be described by its atomic constituents. The generated

descriptors, known as transferable atom equivalent (TAE),

describe the molecules in terms of the electron densities, ener-

gies, and properties. Molecular descriptors for molecules of in-

terest are obtained by searching from the database of precom-

puted property of atomic fragments. Atomic fragments bearing

similarities to those found in the database are utilized for the

reconstruction of the molecular property. In spite of the high-

throughput nature of the TAE descriptors, the quality was shown

to match those derived from ab initio calculations at minimal

computational cost.41 In fact, descriptors derived from RECON

have been successfully applied in various QSPR studies.39,42,43

E-DRAGON produced over 1600 molcular descriptors44 com-

prising of 20 types with the quantity of descriptors for each type

shown in parenthesis: Constitutional descriptors (48), topological

descriptors (119), walk and path counts (47), connectivity indi-

ces (33), information indices (47), 2D autocorrelations (96),

Figure 1. Strategy for prediction of the excitation and the emission

maxima of GFP.

1276 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry

Journal of Computational Chemistry DOI 10.1002/jcc

edge adjacency indices (107), BCUT descriptors (64), topologi-

cal charge indices (21), eigenvalue-based indices (44), randic

molecular profiles (41), geometrical descriptors (74), RDF

descriptors (150), 3D-MoRSE descriptors (160), WHIM descrip-

tors (99), GETAWAY descriptors (197), functional group counts

(154), atom-centered fragments (120), charge descriptors (14),

and molecular properties (31).

Quantum Chemical Descriptor Calculation

The molecular structure of each GFP chromophores was optimized

by a Monte Carlo or systematic conformational search for the low-

est energy geometry using Merck Molecular Force Field (MMFF)

followed by geometry optimization at the Hartree-Fock (HF) level

of theory using the 3-21G split-valence basis set (HF/3-21G).

Quantum chemical descriptors were calculated at the density func-

tional theory (DFT) using Becke’s three-parameter Lee-Yang-Parr

(B3LYP) functional and 6-31G* basis set (B3LYP/6-31G*).

The quantum chemical descriptors were calculated using Spar-

tan’04 and includes the following: molecular weight (MW), sur-

face area of a space-filling model (CPKArea), total energy (ETotal),

energy of the highest occupied molecular orbital (EHOMO), energy

of the lowest unoccupied molecular orbital (ELUMO), and dipole

moment (�). In addition, the quantum chemical indices of hard-

ness (�) and electrophilicity (!) were calculated according to the

method summarized by Thanikaivelan et al.45 as follows:

� ¼ ðELUMO � EHOMOÞ2

ð1Þ

! ¼ ðEHOMO þ ELUMO=2Þ22�

ð2Þ

Moreover, the mean absolute atomic charge (Qm) was calcu-

lated according to the method reviewed by Karelson et al.46

from the Mulliken population analysis as follows:

Qm ¼XNa¼1

jQaj=N(3)

where |Qa| represents the absolute value of the charges on all

atoms and N represents the total number of atoms presented in

the chromophore molecule.

The calculated descriptors were subjected to standardization

in order to adjust all the descriptors to approximately the same

scale with a mean of zero and standard deviation of one. Stand-

ardization is performed according to the following equation:

xsinij ¼ xij � xjPNi¼1

ðxij � xjÞ2=N(4)

where xsinij represents the standardized value, xij represents the

value of each sample, xj represents the mean of each descriptor,

and N represents the sample size of the data set.

Generation of Training and Testing Sets

The data set was divided into training and testing sets by the

LOO-CV approach whereby one sample of the data set was left

out as the testing set while the remaining were used as the train-

ing set. The training sets were used to construct a predictive

model and predictions were made on the testing set. This pro-

cess was repeated until all samples of the data set had a chance

Table 1. Dataset of the Chromophore in GFP Color Variants.

No. Chromophore Mutant �exca �em

a References

1 AWG S65A, Y66W, S72A, N1461,

M153T, V163A

434 477 26

2 AYG S65A 471 504 27

3 CYG S65C 479 507 27

4 GYG S65G 487 509 28

5 LYG S65L 484 510 27

6 S(p-amino-F)G Y66(p-amino-F), F99S, M153T, V163A 435 498 29

7 S(p-bromo-F)G Y66(p-bromo-F), F99S, M153T, V163A 375 428 29

8 S(p-iodo-F)G Y66(p-iodo-F), F99S, M153T, V163A 381 438 29

9 S(p-methoxy-F)G Y66(p-methoxy-F), F99S, M153T, V163A 394 460 29

10 SFG Y66F 360 442 27

11 SHG Y66H 382 448 27

12 SWG Y66W 436 485 27

13 SYA G67A 410 454 30

14 SYG None or Q80R 395 508 31

15 T(3-fluoro-Y)G F64L, S65T, Y66(3-fluoro-Y) 484 514 13

16 T(4-amino-W)G W57(4-amino-W), F64L, S65T, Y66

(4-amino-W), N1461, M153T, V163A

466 574 32

17 THG F64L, S65T, Y66H, Y145F 380 440 33

18 TWG F64L, S65T, Y66W 450 494 34

19 TYG S65T 488 511 27

aBoth are the maxima of the excitation and the emission wavelength (nm).

1277Prediction of GFP Spectral Properties Using Artificial Neural Network

Journal of Computational Chemistry DOI 10.1002/jcc

Figure 2. Chemical structures of the chromophores of GFP color variants data set (refer to Table 1

for the name of the chromophore).

1278 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry

Journal of Computational Chemistry DOI 10.1002/jcc

to be used as the testing set. Network training was performed

with LOO-CV using Weka, version 3.4.5.47

Methods of Neural Network Training

Data from the data sets were submitted for network training

using the back-propagation neural network implementation of

Weka, version 3.4.5. The molecular descriptors were presented

to the input layer where they were relayed to the nodes of the

hidden layer for processing and finally onto the output nodes.

Network training using the empirically determined network pa-

rameters began with random initialization of weights between

the network nodes. As training proceeds, the weights were ad-

justed according to the back-propagation of error approach.

Briefly, information on the prediction error is propagated back-

wards from the output layer to the hidden and input layer and

followed by adjustments of the weights according to the predic-

tion error.48 Each initialization of weights differed but gave

slightly varying prediction output. The predicted outputs were

derived from the average of 10 runs of network training.

Search for Optimal Network Parameters

The search for optimal parameters for carrying out neural net-

work calculations was obtained by an empirical trial-and-error

approach in which the parameter(s) under investigation was sub-

jected to an incremental increase using root mean square error

(RMS) as a measure of predictive error, which is calculated with

the following equation:

RMS ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPNi¼1

ðpi � aiÞ2

n

vuuut(5)

where pi represents the predicted output, ai represents the actual

output, and n represents the number of chromophore molecules

presented in the data set. The search for optimal neural network

parameters includes the number of nodes in the hidden layer,

number of learning epochs, learning rate, and momentum con-

stants.

Multiple Linear Regression

The MLR models were generated by The Unscrambler 9.549

software package to obtain equations of the form:

Y ¼ B0 þX

BnXn (6)

in which Y is the GFP spectral property under investigation (ex-

citation or emission maxima), B0 is the intercept, and Bn are the

regression coefficients of descriptors Xn.

Partial Least Squares Regression

PLS50,51 analysis was performed with The Unscrambler 9.5 soft-

ware package using the PLS1 algorithm. Each chromophore was

described by nine quantum chemical descriptors and predictions

made on the response variable, which in our case are the GFP

spectral properties. The descriptors were preprocessed by mean-

centering and autoscaling to zero mean and unit variance

according to eq. (4). The number of variables in the descriptor

matrix is then reduced to a small number of latent variables

known as PLS components (PC), which still retains the main in-

formation from the original data set. The PCs are few, orthogo-

nal, and serve as predictors of the response variable.51 The opti-

mal number of PLS components were determined according to

the method of Haaland and Thomas52 from a plot of PC against

mean squared error (MSE) using LOO-CV. MSE is calculated

according to the following equation:

Table 2. Chemical Structures of the Synthetic GFP Chromophore Data Set.

Chromophore R1 R2 R3 �exca �em

a

I-1 OH Me Me 370 440

I-2 OH Me (CH2)3Me 372 436

I-3 OH Ph H 399 467

I-4 OH 3,4-diMeOPh Me 398 476

I-5 H Ph H 384 444

I-6 H Ph OH 385 443

I-7 H Ph Ph 380 453

I-8 H 4-MeOPh H 387 452

I-9 H 3,4-diMeOPh H 390 457

I-10 H 3,4-diMeOPh Me 386 470

I-11 MeO Ph H 398 464

I-12 MeO Ph Me 394 471

I-13 MeO Ph CH2COOEt 389 465

I-14 MeO Ph CH2Ph 393 468

I-15 MeO 4-NO2Ph H 428 573

I-16 OCH2COOEt Ph H 397 461

I-17 H, 2-MeO Ph H 400 465

I-18 N(Me)2 Me H 415 483

I-19 N(Me)2 Ph H 454 520

I-20 N(Me)2 4-MeOPh H 450 507

I-21 N(Me)2 3,4-diMeOPh H 455 515

I-22 CF3 Ph H 383 451

I-23 CN Ph H 393 464

I-24 COOMe Ph H 394 461

I-25 CN 4-MeOPh H 401 482

I-26 CN 3,4-diMeOPh H 405 482

I-27 COOMe 3,4-diMeOPh H 403 478

I-28 CN 4-NO2Ph H 406 508

I-29 COOMe 4-NO2Ph H 405 510

Data from ref. 35.aBoth are the maxima of excitation and emission wavelength (nm).

MSE ¼PNi¼1

ðpi � aiÞ2

n(7)

1279Prediction of GFP Spectral Properties Using Artificial Neural Network

Journal of Computational Chemistry DOI 10.1002/jcc

where pi represents the predicted output, ai represents the actual

output, and n represents the number of chromophore molecules

presented in the data set.

Evaluation of Prediction Accuracy

To evaluate the ability of the learning methods to predict the ex-

citation and the emission maxima, root mean square error

(RMS) was used as a measure of prediction error. After predic-

tions of the excitation and the emission maxima were made, the

correlation coefficient r was used to assess the correlation

between the predicted and experimental values.

Results and Discussion

The data set is comprised of the chromophores of various GFP

color variants and synthetic GFP chromophores. The spectral

data of GFP mutants were drawn from the literature by focusing

on color variants of GFP. The presence of mutations in the pro-

Table 3. Calculated Quantum Chemical Descriptors of the Chromophores of GFP Color Variants Without

Taking into Account the Protonation State.

Chromophore MW CPKArea ETotal EHOMO ELUMO � � ! Qm

AWG 298.346 323.050 �989.923 �4.710 �1.210 4.300 �656.487 2.503 0.252

AYG 273.292 296.010 �932.429 �5.530 �1.730 4.690 �614.219 3.468 0.293

CYG 305.358 316.440 �1330.610 �5.650 �1.860 2.680 �823.526 3.720 0.289

GYG 259.265 276.890 �893.112 �5.530 �1.730 4.740 �585.001 3.468 0.302

LYG 315.373 351.710 �1050.370 �5.520 �1.810 2.350 �701.038 3.621 0.268

S(p-amino-F)G 288.307 306.190 �987.778 �5.190 �1.640 3.290 �646.984 3.285 0.305

S(p-bromo-F)G 352.188 312.800 �944.991 �6.080 �2.240 3.110 �628.896 4.507 0.272

S(p-iodo-F)G 399.188 319.950 �943.206 �6.080 �2.260 3.150 �631.578 4.552 0.271

S(p-methoxy-F)G 303.318 322.960 �1046.950 �5.610 �1.860 2.690 �684.953 3.720 0.285

SFG 273.292 292.720 �932.422 �5.990 �2.050 1.690 �612.571 4.102 0.274

SHG 263.257 273.230 �926.395 �5.760 �1.800 4.430 �599.812 3.608 0.319

SWG 314.345 328.870 �1065.140 �4.870 �1.380 3.330 �697.003 2.798 0.261

SYA 303.318 321.440 �1046.950 �5.850 �2.100 6.010 �684.197 4.214 0.293

SYG 289.291 311.280 �1007.630 �5.710 �2.050 2.610 �659.455 4.113 0.296

T(3-fluoro-Y)G 321.308 325.740 �1146.190 �5.820 �2.040 2.130 �735.965 4.086 0.312

T(4-amino-W)G 343.387 356.950 �1159.810 �4.640 �1.400 1.450 �758.382 2.815 0.292

THG 277.284 292.490 �965.713 �5.750 �1.790 4.520 �629.101 3.589 0.316

TWG 328.372 348.100 �1104.450 �4.860 �1.370 3.350 �726.277 2.780 0.263

TYG 303.318 321.110 �1046.960 �5.680 �1.890 2.730 �684.034 3.780 0.301

Table 4. Calculated Quantum Chemical Descriptors of the Chromophores of GFP Color Variants Taking into

Account the Protonation State.

Chromophore MW CPKArea ETotal EHOMO ELUMO � � ! Qm

AWG 298.346 323.050 �989.923 �4.710 �1.210 4.300 1.750 2.503 0.252

AYG 272.284 293.340 �931.891 �1.090 1.920 12.080 1.505 0.057 0.279

CYG 304.350 313.720 �1330.080 �1.230 1.770 10.380 1.500 0.024 0.277

GYG 258.257 274.180 �892.574 �1.080 1.920 11.280 1.500 0.059 0.286

LYG 314.365 350.280 �1049.830 �1.140 1.860 14.190 1.500 0.043 0.258

S(p-amino-F)G 288.307 306.190 �987.778 �5.190 �1.640 3.290 1.775 3.285 0.305

S(p-bromo-F)G 352.188 312.800 �944.991 �6.080 �2.240 3.110 1.920 4.507 0.272

S(p-iodo-F)G 399.188 319.950 �943.206 �6.080 �2.260 3.150 1.910 4.552 0.271

S(p-methoxy-F)G 303.318 322.960 �1046.950 �5.610 �1.860 2.690 1.875 3.720 0.285

SFG 273.292 292.720 �932.422 �5.990 �2.050 1.690 1.970 4.102 0.274

SHG 263.257 273.230 �926.395 �5.760 �1.800 4.430 1.980 3.608 0.319

SWG 314.345 328.870 �1065.140 �4.870 �1.380 3.330 1.745 2.798 0.261

SYA 303.318 321.440 �1046.950 �5.850 �2.100 6.010 1.875 4.214 0.293

SYG 289.291 311.280 �1007.630 �5.710 �2.050 2.610 1.830 4.113 0.296

T(3-fluoro-Y)G 320.300 326.130 �1145.650 �1.280 1.670 11.540 1.475 0.013 0.298

T(4-amino-W)G 343.387 356.950 �1159.810 �4.640 �1.400 1.450 1.620 2.815 0.292

THG 277.284 292.490 �965.713 �5.750 �1.790 4.520 1.980 3.589 0.316

TWG 328.372 348.100 �1104.450 �4.860 �1.370 3.350 1.745 2.780 0.263

TYG 302.310 318.290 �1046.420 �1.260 1.740 12.550 1.500 0.019 0.289

1280 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry

Journal of Computational Chemistry DOI 10.1002/jcc

tein environment was assumed to have no significant effect on

the spectral properties of the GFP. Therefore, a total of 19 GFP

mutants with unique chromophore tripeptides were obtained.

The chromophores of these GFP mutants were truncated from

the protein backbone by cleavage at the peptide bond connecting

to the chromophore, followed by addition of a hydrogen atom

to the chromophore to fill the valences. The spectral information

of synthetic GFP chromophores was taken from the work of

Follenius-Wund et al.35

There are a variety of molecular descriptors to use that are ca-

pable of describing molecules of interest in a quantitative struc-

ture-property manner. However, the selection of appropriate type

of descriptors depends on the property under investigation. In this

study, we are interested in predicting spectral properties of the

GFP chromophore. Therefore, the most suitable descriptors should

be able to account for the molecular mechanism governing the ex-

citation and the emission phenomena. Voityuk et al. explained that

the chromophore absorption peak arises from the HOMO-LUMO

electronic transition,25 which warrants investigation into its effec-

tiveness in the prediction of GFP spectral properties. Furthermore,

the electrons of the HOMO and LUMO of the chromophore are

delocalized over the entire molecule through � orbitals, as shown

in Figure 3. Moreover, the HOMO-LUMO electronic transition is

accompanied by a charge transfer in the chromophore where the

electron density is withdrawn from the phenol ring and transmitted

to the heterocyclic imidazole ring. Thus, modifications made to

the chromophore changes the distribution of its electron density,

thereby directly influencing the spectral properties.

To test whether the energies of the highest occupied molecu-

lar orbital (HOMO) and the lowest unoccupied molecular orbital

(LUMO) are relevant and crucial towards the prediction of spec-

tral properties of the GFP, we compared the efficiency of molec-

ular descriptors generated by three softwares, namely RECON,

E-DRAGON, and Spartan’04. The RECON software produces

descriptors accounting for the electron densities, energies, and

properties as described previously. The E-DRAGON software

generates 1600 molecular descriptors consisting of over 20 dif-

ferent types of descriptors as described in Methods. For this

investigation, Spartan’04 was used to generate six molecular

descriptors comprising of three quantum chemical descriptors

(ETotal, EHOMO, and ELUMO), one constitutional descriptor

(MW), one charge descriptor (�), and one geometrical descriptor

(CPK area). Furthermore, three additional molecular descriptors

consisting of two quantum chemical descriptors (� and !) and

one charge descriptor (Qm) were derived from the equations

described in Methods. Table 6 summarizes the performance of

the four data sets using the different molecular descriptors as

input variables. The molecular descriptors generated by RECON

Table 5. Calculated Quantum Chemical Descriptors of the Synthetic GFP Chromophore Data Set.

Chromophore MW CPKArea ETotal EHOMO ELUMO � � ! Qm

i-1 216.240 244.210 �724.460 �5.490 �1.670 4.110 1.910 3.355 0.291

i-2 258.321 304.900 �842.400 �5.470 �1.650 4.010 1.910 3.318 0.264

i-3 264.284 284.740 �876.880 �5.490 �2.020 4.690 1.735 4.063 0.262

i-4 338.363 362.270 �1145.230 �5.290 �1.780 6.780 1.755 3.560 0.266

i-5 248.285 275.580 �801.670 �5.760 �2.140 3.520 1.810 4.310 0.229

i-6 264.284 283.180 �876.820 �5.840 �2.320 2.180 1.760 4.729 0.232

i-7 324.383 351.690 �1032.710 �5.730 �2.070 3.120 1.830 4.156 0.199

i-8 278.311 305.870 �916.190 �5.550 �1.980 5.240 1.785 3.971 0.247

i-9 308.337 335.640 �1030.700 �5.530 �1.960 6.000 1.785 3.929 0.254

i-10 322.364 353.100 �1070.010 �5.500 �1.900 5.570 1.800 3.803 0.242

i-11 278.311 305.870 �916.190 �5.420 �1.980 4.760 1.720 3.980 0.246

i-12 292.338 323.630 �955.500 �5.380 �1.880 4.480 1.750 3.765 0.232

i-13 366.373 381.630 �1258.590 �5.490 �1.910 3.850 1.790 3.824 0.261

i-14 368.436 403.130 �1186.550 �5.420 �1.860 4.420 1.780 3.722 0.211

i-15 323.308 331.340 �1120.690 �5.770 �3.000 3.430 1.385 6.942 0.271

i-16 352.346 370.790 �1219.270 �5.530 �2.060 7.170 1.735 4.150 0.271

i-17 280.327 308.130 �917.350 �5.640 �2.100 3.190 1.770 4.231 0.235

i-18 229.283 267.160 �743.900 �4.900 �1.410 4.590 1.745 2.852 0.277

i-19 291.354 326.880 �935.640 �4.900 �1.750 5.090 1.575 3.510 0.243

i-20 321.380 357.200 �1050.160 �4.790 �1.600 6.600 1.595 3.200 0.255

i-21 351.406 386.960 �1164.670 �4.780 �1.590 7.290 1.595 3.180 0.261

i-22 316.282 310.640 �1138.700 �6.070 �2.450 5.090 1.810 5.013 0.253

i-23 273.295 296.170 �893.910 �6.180 �2.640 6.800 1.770 5.494 0.245

i-24 306.321 328.350 �1029.550 �5.950 �2.420 5.720 1.765 4.962 0.254

i-25 303.321 326.440 �1008.430 �5.930 �2.490 7.490 1.720 5.152 0.261

i-26 333.347 356.210 �1122.950 �5.900 �2.480 7.960 1.710 5.133 0.267

i-27 364.401 397.860 �1222.630 �5.370 �1.990 6.130 1.690 4.007 0.261

i-28 318.292 321.650 �1098.410 �6.570 �3.380 4.320 1.595 7.759 0.272

i-29 351.318 353.840 �1234.040 �6.340 �3.230 0.240 1.555 7.362 0.277

1281Prediction of GFP Spectral Properties Using Artificial Neural Network

Journal of Computational Chemistry DOI 10.1002/jcc

and E-DRAGON were subjected to descriptor reduction accord-

ing to the unsupervised forward selection (UFS)53 algorithm as

described previously.54 Next, the performance of the three dif-

ferent descriptors were evaluated from ANN using the default

network parameters. Of all the descriptors generated, those from

Spartan’04 gave superior predictive performance over that of

RECON and E-DRAGON. Therefore, we concluded that the

quantum chemical descriptors derived from Spartan’04 were

most suitable for training the artificial neural network.

In artificial neural networks, optimal parameters are not uni-

versal for all types of data; rather, optimal parameters are

obtained by an exhaustive search. To find the optimal parame-

ters for neural network calculations, an empirical trial-and-error

search was performed using incremental increase in the value of

the parameter under investigation and using RMS as the measure

of prediction error. In this study, the first parameter to be opti-

mized is the number of nodes in the hidden layer (Figs. 4a, 4b,

5a, and 5b), followed by the number of learning epochs (Figs.

4c, 4d, 5c, and 5d), and finally the learning rate and momentum

constants (Figs. 4e, 4f, 5e, and 5f) for the excitation and the

emission maxima data set, respectively.

The network architecture used in this study comprises of

three layers including the input layer, hidden layer, and output

layer (Fig. 6). Data of the nine molecular descriptors (MW,

CPKarea, ETotal, EHOMO, ELUMO, �, �, !, and Qm) served as

inputs with the excitation and the emission maxima as the out-

puts. Thus, there are nine nodes in the input layer and two nodes

in the output layer. As for the hidden layer, the optimal value

was determined by an empirical trial-and-error search over the

ranges of 1–24 nodes. In silico prediction of the excitation and

the emission maxima were performed by using the empirically

determined network parameters and the average of 10 runs were

used for the output variables.

In this study, LOO-CV was used for model validation in

which repeated resampling of the data was carried out as fol-

lows. One sample of the data set was left out as the testing set

and training was performed on the remaining samples. The pre-

dictive model that was obtained from training was then tested

on the sample that was left out. This is carried out reiteratively

until all samples were left out for prediction. When the data set

is small, partitioning it into training and testing set is not feasi-

ble as it would result in insufficient training data to be used for

the construction of the predictive model. Thus, the use of LOO-

Figure 3. Calculated HOMO (a) and LUMO (b) of the neutral form

of the chromophore. [Color figure can be viewed in the online issue,

which is available at www.interscience.wiley.com.]

Table 6. Summary of the Predictive Performance as a Function of the Descriptor Used.

Model

RECON E-DRAGON Spartan’04

na rTRb rCV

c RMSTRd RMSCV

e na rTRb rCV

c RMSTRd RMSCV

e na rTRb rCV

c RMSTRd RMSCV

e

1f 15 1.000 0.431 0.013 49.104 17 0.999 0.276 2.023 58.464 9 0.997 0.970 3.620 10.707

2g 15 1.000 0.111 0.247 59.935 17 0.995 0.392 3.446 38.166 9 0.998 0.922 2.410 14.516

3h 18 0.996 0.803 1.913 13.194 28 1.000 0.570 0.601 19.411 9 0.993 0.935 2.784 8.049

4i 18 0.997 0.617 2.718 27.223 28 1.000 0.434 0.115 30.937 9 0.991 0.939 4.111 10.392

aNumber of chromophores in the data set.bTraining correlation coefficient.cCross-validated correlation coefficient.dRoot mean square error for training.eRoot mean square error for leave-one-out cross-validation.fGFP color variants (Excitation maxima) data set.gGFP color variants (Emission maxima) data set.hSynthetic GFP chromophore (Excitation maxima) data set.iSynthetic GFP chromophore (Emission maxima) data set.

1282 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry

Journal of Computational Chemistry DOI 10.1002/jcc

CV is appropriate for small data sets as it allows the economical

use of the data.47 However, there are potential cautions that one

should consider when using the LOO approach. Firstly, the com-

putational demand of LOO-CV is high since training is per-

formed for n-folds, where n being the sample size, which may

not be suitable for large data set. It has also been mentioned that

LOO does not perturb the data set sufficiently and may result in

high variance estimates and overfitting of the model.55 Further-

Figure 4. Optimization of neural network parameters of the GFP color variants data set. (a, b) Root

mean square error (RMS) as a function of the number of nodes in the hidden layer for the excitation

(a) and the emission (b) maxima data set. (c, d) RMS as a function of the number of learning epochs

for the excitation (c) and the emission (d) maxima data set. (e, f) Contour plot of RMS versus the

learning rate and the momentum constants for the excitation (e) and the emission (f) maxima data set.

The red lines represent constant value of the RMS, while n represent RMS values obtained from the

learning procedure, which is fitted onto the same surface model of the contour plot. [Color figure can

be viewed in the online issue, which is available at www.interscience.wiley.com.]

1283Prediction of GFP Spectral Properties Using Artificial Neural Network

Journal of Computational Chemistry DOI 10.1002/jcc

more, stratification of the data set is not possible since only one

sample at a time is used as the test set. Nevertheless, LOO has

been assured to perform at similar level to that of the training

error estimate in the worst case scenario,56 and it has been known

to provide a good estimate of the generalization error,57,58 which

in our case is the RMS.

To assign the correct protonation states on the chromophores,

an investigation of the absorbance spectra in the literature was

Figure 5. Optimization of neural network parameters of the synthetic GFP chromophore data set. (a,

b) Root mean square error (RMS) as a function of the number of nodes in the hidden layer for the ex-

citation (a) and the emission (b) maxima data set. (c, d) RMS as a function of the number of learning

epochs for the excitation (c) and the emission (d) maxima data set. (e, f) Contour plot of RMS versus

the learning rate and the momentum constants for the excitation (e) and the emission (f) maxima data

set. The red lines represent constant value of the RMS, while n represent RMS values obtained from

the learning procedure, which is fitted onto the same surface model of the contour plot. [Color figure

can be viewed in the online issue, which is available at www.interscience.wiley.com.]

1284 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry

Journal of Computational Chemistry DOI 10.1002/jcc

carried out. On the basis of the original articles, the chromo-

phores were categorized to either the protonated or deprotonated

form. Since the major peak at 395 nm of the absorbance spectra

is associated with the protonated form of the chromophore, the

wild-type GFP with the chromophore tripeptide SYG was

assigned the protonated form. Likewise, GFP mutants with

minor peaks were associated with the anionic chromophore and

so were categorized as deprotonated. Assignment of the correct

protonation state was performed on chromophores harboring the

p-hydroxybenzylidene. The excitation and the emission maxima

of chromophores without consideration of the protonation state

gave correlation coefficient and root mean square error of r ¼0.3272, RMS ¼ 57.7310 and r ¼ 0.7209, RMS ¼ 32.1526,

respectively (Figs. 7a and 7b). In addition, the excitation and the

emission maxima of chromophores accounting for the protona-

tion state gave correlation coefficient and root mean square error

of r ¼ 0.9795, RMS ¼ 8.8237 and r ¼ 0.9067, RMS ¼15.7614, respectively (Figs. 7c and 7d). On the basis of these

results, the correct protonation state of the chromophore is cru-

cial for accurate prediction of the excitation and the emission

maxima of the GFP chromophores studied. Chromophores drawn

without taking the protonation states into consideration gave

poor prediction accuracy because the descriptors derived from

these structures were irrelevant to the spectral properties under

investigation. By not taking into account the chromophore’s pro-

tonation state, the predictive model could not distinguish

between the major and minor peaks when given descriptors

derived from the same protonation state as inputs.

The evidences suggested that synthetic GFP chromophores

had only one protonation form as observed by the absorbance

spectra. The correlation coefficients and root mean square error

of the excitation and the emission maxima were r ¼ 0.9335,

RMS ¼ 9.9095 and r ¼ 0.9626, RMS ¼ 9.7508, respectively

(Figs. 7e and 7f).

To determine whether presence of the �-barrel have any

influence on predictive performance of the spectral properties,

we made comparisons between the prediction accuracy of data

set derived from the chromophores of GFP color variants and

data set comprising of the synthetic GFP chromophores. The

GFP color variants data set is comprised of chromophores

encapsulated within the protective boundaries of the proteins �-barrel, shielding it from exposure to neighboring solvents. On

the other hand, the synthetic GFP chromophore data set consists

of bare chromophores exposed to the quenching effect of solvent

molecules. We confirmed the importance of quantum chemical

descriptors in calculating the spectral properties regardless of

their confinement within the �-barrel or their interaction with

neighboring side chains. This is not to say that the hydrogen-

bonding network between the chromophore and its immediate

vicinity plays no crucial role in the spectral properties, rather

these interactions are important in governing the protonation

states of the chromophore and the correct account of which

facilitates accurate calculation of the excitation and the emission

maxima, as previously mentioned.

For comparison, the traditional modeling approach in QSPR

studies, particularly MLR and PLS regression analysis, was used

to predict the GFP spectral properties. All descriptors as used in

ANN analysis were employed in the calculations. The regression

coefficients derived from MLR and PLS are shown in Tables 7

and 8, respectively. Comparisons of the three learning algo-

rithms are shown in Tables 9–11. Although all learning algo-

rithms were capable of predicting the spectral properties with ac-

curacy in the range of 0.8399 � r � 0.9795, only ANN showed

consistent superior prediction accuracy over that of both MLR

and PLS for the four models. From the prediction results, it was

observed that all learning approach performed well on model 1

with precision in the range of 0.9498 � r � 0.9795. For model

2, PLS performed slightly better than ANN and surpass that of

MLR. It is observed that ANN outperformed both MLR and

PLS for models 3 and 4, thus suggesting that both models are of

non-linear nature. As shown in Table 10, it is also interesting to

note that the performance of PLS, as observed by the LOO-CV

r, is positively correlated with the total percent of explained var-

iance s2CV. This reveals that the total variability accounted by

the PLS model is crucial towards the predictive performance.

It is observed that the predictive performance of excitation

maxima by various learning approaches is better than that of

emission maxima for models 1 and 2. This is to be expected as

the excitation property of chromophores is well correlated with

its molecular structure.59 However, the emission phenomenon of

chromophores is of a complex nature since a quantity of energy

is dissipated as heat to variable degree depending on the molec-

ular structure. The reduction of energy during emission causes a

red-shift of the wavelength because of the inverse relationship

between energy and wavelength as explained by Planck’s equa-

tion:

E ¼ h� ¼ hc=� (8)

where E is energy, h is the Planck’s constant, � is the frequency,

c is the speed of light, and � is the wavelength.

Figure 6. Scheme of artificial neural network used in this study. The

network is comprised of three layers: input layer, hidden layer, and

output layer. Signals are propagated in a feed-forward manner from

the input layer through the hidden layer and onto the output layer fol-

lowed by adjustment of the weights according to the prediction error

(see ref. 48 for more detail). Nodes are represented by squares while

weights are represented by arrows. [Color figure can be viewed in the

online issue, which is available at www.interscience.wiley.com.]

1285Prediction of GFP Spectral Properties Using Artificial Neural Network

Journal of Computational Chemistry DOI 10.1002/jcc

Figure 7. Predicting the excitation and the emission maxima of GFP color variants. (a–d) Plot of the pre-

dicted versus experimental values of the excitation (a, c) and the emission (b, d) maxima for calculations

made based on chromophores not taking the protonation state into account (a, b) and for computations

made based on chromophores taking the protonation state into consideration (c, d). (e, f) Plot of the pre-

dicted versus experimental values of the excitation (e) and the emission (f) maxima for synthetic GFP

chromophores. n and solid lines represent samples and regression line of the leave-one-out cross-validated

test set, respectively. & and dotted lines represent samples and regression line of the training set, respec-

tively. [Color figure can be viewed in the online issue, which is available at www.interscience.wiley.com.]

1286 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry

Journal of Computational Chemistry DOI 10.1002/jcc

The performance of ANN on models 3 and 4 is superior to

those of MLR and PLS. These results indicate the possible influ-

ence of the microenvironment of the chromophores on the per-

formance of the various learning approaches. As we have dis-

cussed previously, the chromophores for models 1 and 2 were

derived from various GFP color variants where they are shielded

from solvent molecules within the protective confinement of the

�-barrel. On the other hand, the chromophores for models 3 and

4 are solvent exposed and are prone to collisional quenching

effect of oxygen and water molecules.60 Furthermore, alteration

in the electron flux2,61 of its microenvironment is due to absence

of the hydrogen bonding network1,62 may additionally influence

the spectral properties of these isolated chromophores. For

example, the lack of the intricate hydrogen bonding network

may contribute to greater conformational freedom of the chro-

mophore resulting in an accelerated internal conversion,35,61

which distorts the coplanarity of the benzyl and imidazole rings

as suggested by quantum chemical simulations.63,64 It should be

noted that the quantum chemical calculations were performed

in vacuo and possible solvent effects on the chromophore were

unaccounted for by these computations. It is generally known

that the native GFP chromophore resides in a fairly protective

microenvironment and the interior dielectric constant is assumed

to be low (" ¼ 2–4).65–68 Since the difference in dielectric con-

stant of the chromophore microenvironment and that of vacuum

(" ¼ 1) is negligible, we assume that this did not affect the pre-

dictive performance. This is observed by the superior perform-

ance of all learning methods on Model 1 where r is in excess of

0.94. However, quite the opposite is valid for the isolated chro-

mophores since they are solvent exposed where the dielectric

constant (" ¼ 80) is significantly different from that of vacuum.

This is not to say that the descriptors are inadequate in describ-

ing the spectral phenomenon, rather it may not be suitable for

linear regression analysis. Since ANN is known to be suitable in

cases where the mechanistic understanding is not well known,69

it is capable of modeling the spectral properties without knowl-

edge of the solvent effect. Moreover, the predictive performance

Table 7. Regression Coefficients of MLR Models.

Descriptors Model 1a Model 2b Model 3c Model 4d

MW 9.240E�01 �1.093E+01 4.344E+01 3.858E+01

CPKArea �3.820E�01 5.067E+00 3.165E+01 2.266E+01

ETotal �1.169E+06 1.295E+06 3.850E+05 �3.469E+05

EHOMO 9.719E+05 �1.077E+06 �4.302E+05 3.877E+05

ELUMO 1.868E+01 �3.060E+01 2.736E+00 4.754E+00

� �2.811E+00 1.712E+01 �1.408E+01 �1.558E+01

� �2.113E+05 2.340E+05 1.986E+05 �1.790E+05

! �6.160E+01 2.340E+01 �2.138E+01 7.946E+01

Qm 1.355E+00 1.247E+01 2.926E+00 �4.780E�01

aGFP color variants (excitation maxima) data set, Y-intercept is 431.104.bGFP color variants (emission maxima) data set, Y-intercept is 484.264.cSynthetic GFP chromophore (excitation maxima) data set, Y-intercept is400.458.dSynthetic GFP chromophore (emission maxima) data set, Y-intercept is

474.712.

Table 8. Regression Coefficients of PLS Models.

Descriptors Model 1a Model 2b Model 3c Model 4d

MW �2.241E+00 �7.401E+00 �3.700E�01 2.880E�01

CPKArea 3.638E+00 7.964E+00 1.854E+00 6.530E�01

ETotal 6.257E+00 7.515E+00 9.421E+00 8.755E+00

EHOMO 6.790E�01 �1.600E�01 1.570E�01 �2.770E+00

ELUMO �8.815E+00 �3.058E+01 3.537E+00 �7.600E�01

� 1.107E+01 1.991E+01 7.010E�01 9.890E�01

� �3.151E+01 �4.233E+01 �1.792E+01 �2.297E+01

! �1.315E+01 �5.987E+00 7.050E�01 5.192E+00

Qm 5.271E+00 1.510E+01 �1.293E+00 3.547E+00

aGFP color variants (excitation maxima) data set, Y-intercept is 431.105.bGFP color variants (emission maxima) data set, Y-intercept is 484.263.cSynthetic GFP chromophore (excitation maxima) data set, Y-intercept is

400.483.dSynthetic GFP chromophore (emission maxima) data set, Y-intercept is474.690.

Table 9. Performance of MLR Models.

Model na rTRb rCV

c RMSTRd RMSCV

e

1f 19 0.9910 0.9498 5.8537 13.9642

2g 19 0.9770 0.8412 7.6775 21.0530

3h 29 0.9480 0.8399 6.7854 13.0558

4i 29 0.9775 0.9430 6.1122 9.7858

aNumber of chromophores in the data set.bTraining correlation coefficient.cCross-validated correlation coefficient.dRoot mean square error for training.eRoot mean square error for leave-one-out cross-validation.fGFP color variants (excitation maxima) data set.gGFP color variants (emission maxima) data set.hSynthetic GFP chromophore (excitation maxima) data set.iSynthetic GFP chromophore (emission maxima) data set.

Table 10. Performance of PLS Models.

Model na rTRb rCV

c NPCd s2TR

e s2CVf RMSTR

g RMSCVh

1i 19 0.9780 0.9496 5 95.655 90.331 9.1145 14.3518

2j 19 0.9696 0.9237 5 94.016 86.455 8.8030 13.9798

3k 29 0.9425 0.8892 4 88.825 79.741 7.1232 9.9334

4l 29 0.9603 0.9363 3 92.216 88.493 8.0876 10.1841

aNumber of chromophores in the data set.bTraining correlation coefficient.cCross-validated correlation coefficient.dNumber of PLS components.eTotal % explained variance for training.fTotal % explained variance for leave-one-out cross-validation.gRoot mean square error for training.hRoot mean square error for leave-one-out cross-validation.iGFP color variants (excitation maxima) data set.jGFP color variants (emission maxima) data set.kSynthetic GFP chromophore (excitation maxima) data set.lSynthetic GFP chromophore (emission maxima) data set.

1287Prediction of GFP Spectral Properties Using Artificial Neural Network

Journal of Computational Chemistry DOI 10.1002/jcc

of excitation maxima for models 3 and 4 is lower than that of

emission maxima as opposed to what was observed for models 1

and 2. This contradicts the assumption that the phenomenon of

excitation is straightforward, and therefore should have higher

predictive performance. Perhaps the excitation phenomenon of

isolated chromophores may be less straightforward than previ-

ously thought as their microenvironments greatly influence their

spectral properties.

Conclusion

We have demonstrated a novel computational method that per-

mits the rational design of GFP color variants, allowing the

effects of chromophore mutations on the spectral properties to be

studied. The predicted excitation and emission maxima of the

GFP chromophores, using back-propagation neural network, were

found to be in good agreement with the experimental values. Our

results indicate that the protonation state of the chromophores,

used for the generation of quantum chemical descriptors, is use-

ful and necessary to obtain satisfactory spectral predictions that

are representative of the experimentally determined values. It

was also shown that regardless of the confinement of the chro-

mophore, the proposed methodology was capable of performing

well on the spectral predictions. Of the three learning methods

used in this study, ANN was found to outperform the traditional

linear regression analysis, particularly MLR and PLS. Overall,

the strategy proposed by us facilitates an in silico approach to

the design of novel GFP color variants as well as synthetic GFP

chromophores, which can then be validated by experimental stud-

ies. Furthermore, the approach used in this study has broad

implications as it could be applied for prediction of the spectral

properties of other fluorescent compounds or proteins.

References

1. Ormo, M.; Cubitt, A. B.; Kallio, K.; Gross, L. A.; Tsien, R. Y.;

Remington, S. J.; Science 1996, 273, 1392.

2. Tsien, R. Y. Annu Rev Biochem 1998, 67, 509.

3. Prasher, D. C.; Eckenrode, V. K.; Ward, W. W.; Prendergast, F. G.;

Cormier, M. J. Gene 1992, 111, 229.

4. Chalfie, M.; Tu, Y.; Euskirchen, G.; Ward, W. W.; Prasher, D. C.

Science 1994, 263, 802.

5. Sun, Y.; Wong, M. D.; Rosen, B. P. J Biol Chem 2001, 276, 14955.

6. Lippincott-Schwartz, J.; Snapp, E.; Kenworthy, A. Nat Rev Mol Cell

Biol 2001, 2, 444.

7. Hink, M. A.; Bisselin, T.; Visser, A. J. Plant Mol Biol 2002, 50, 871.

8. Prachayasittikul, V.; Isarankura Na Ayudhya, C.; Boonpangrak, S.;

Galla, H.-J. J Membr Biol 2004, 200, 47.

9. Isarankura Na Ayudhya, C.; Prachayasittikul, V.; Galla, H.-J. Eur

Biophys J 2004, 33, 522.

10. Prachayasittikul, V.; Isarankura Na Ayudhya, C.; Tantimongcolwat,

T.; Galla, H.-J. Biochem Biophys Res Commun 2005, 326, 298.

11. Prachayasittikul, V.; Isarankura Na Ayudhya, C.; Bulow, L. Biotech-

nol Lett 2001, 23, 1285.

12. Kostov, Y.; Albano, C. R.; Rao, G. Biotechnol Bioeng 2000, 70, 473.

13. Bae, J. H.; Paramita Pal, P.; Moroder, L.; Huber, R.; Budisa, N.

Chembiochem 2004, 5, 720.

14. Bae, J. H.; Rubini, M.; Jung, G.; Wiegand, G.; Seifert, M. H. J.;

Azim, M. K.; Kim, J.-S.; Zumbusch, A.; Holak, T. A.; Moroder, L.;

Huber, R.; Budisa, N. J Mol Biol 2003, 328, 1071.

15. Heim, R.; Prasher, D. C.; Tsien, R. Y. Proc Natl Acad Sci USA

1994, 91, 12501.

16. Das, A. K.; Hasegawa, J. Y.; Miyahara, T.; Ehara, M.; Nakatsuji, H.

J Comput Chem 2003, 24, 1421.

17. Patnaik, S. S.; Trohalaki, S.; Pachter, R. Biopolymers 2004, 75, 441.

18. Tozzini, V.; Nifosi, R. J Phys Chem B 2001, 105, 5797.

19. Yoo, H.-Y.; Boatz, J. A.; Helms, V.; McCammon, J. A.; Langhoff,

P. W. J Phys Chem B 2001, 105, 2850.

20. Donnelly, M.; Fedeles, F.; Wirstam, M.; Siegbahn, P. E.; Zimmer,

M. J Am Chem Soc 2001, 123, 4679.

21. Altoe’, P.; Bernardi, F.; Garavelli, M.; Orlandi, G.; Negri, F. J Am

Chem Soc 2005, 127, 3952.

22. Martin, M. E.; Negri, F.; Olivucci, M. J Am Chem Soc 2004, 126, 5452.

23. Voityuk, A. A.; Kummer, A. D.; Michel-Beyerle, M.-E.; Rosch, N.

Chem Phys 2001, 269, 83.

24. Laino, T.; Nifosi, R.; Tozzini, V. Chem Phys 2004, 298, 17.

25. Voityuk, A. A.; Michel-Beyerle, M.-E.; Rosch, N. Chem Phys 1998,

231, 13.

26. Patterson, G.; Day, R. N.; Piston, D. J Cell Sci 2001, 114, 837.

27. Cubitt, A. B.; Heim, R.; Adams, S. R.; Boyd, A. E.; Gross, L. A.;

Tsien, R. Y. Trends Biochem Sci 1995, 20, 448.

28. Jung, G.; Wiehler, J.; Zumbusch, A. Biophys J 2005, 88, 1932.

29. Wang, L.; Xie, J.; Deniz, A. A.; Schultz, P. G. J Org Chem 2003,

68, 174.

30. Sniegowski, J. A.; Phail, M. E.; Wachter, R. M. Biochem Biophys

Res Commun 2005, 332, 657.

31. Elsliger, M. A.; Wachter, R. M.; Hanson, G. T.; Kallio, K.; Remington,

S. J. Biochemistry 1999, 38, 5296.

32. Gery, S.; Koeffler, H. P. J Mol Biol 2003, 328, 977.

33. Yang, T. T.; Sina, P.; Green, G.; Kittis, P. A.; Chen, Y. T.;

Lybarger, L.; Chervenak, R.; Patterson, G. H.; Piston, D. W.; Kain,

S. R. J Biol Chem 1998, 273, 8212.

34. Sawano, A.; Miyawaki, A. Nucleic Acids Res 2000, 28, e78.

35. Follenius-Wund, A.; Bourotte, M.; Schmitt, M.; Iyice, F.; Lami, H.;

Bourguignon, J.-J.; Haiech, J.; Pigault, C. Biophys J 2003, 85, 1839.

36. Spartan’04. Wavefunction: Irvine, CA.

37. RECON, Version 5.5. Rensselaer Polytechnic Institute: Troy, New

York. Available at http://www.chem.rpi.edu/chemweb/recondoc.

38. E-DRAGON, Version 1.0. Virtual Computational Chemistry Labora-

tory. Available at http://www.vcclab.org.

39. Breneman, C. M.; Rhem, M. J Comput Chem 1997, 18, 182.

Table 11. Performance of ANN Models.

Model na rTRb rCV

c NANNd RMSTR

e RMSCVf

1g 19 0.9953 0.9795 9 � 1 � 1 4.3639 8.8237

2h 19 0.9919 0.9067 9 � 1 � 1 4.5785 15.7614

3i 29 0.9924 0.9335 9 � 8 � 1 2.7500 9.9095

4j 29 0.9840 0.9626 9 � 32 � 1 6.1641 9.7508

aNumber of chromophores in the data set.bTraining correlation coefficient.cCross-validated correlation coefficient.dNumber of nodes in input, hidden, and output layer of ANN.eRoot mean square error for training.fRoot mean square error for leave-one-out cross-validation.gGFP color variants (excitation maxima) data set.hGFP color variants (emission maxima) data set.iSynthetic GFP chromophore (excitation maxima) data set.jSynthetic GFP chromophore (emission maxima) data set.

1288 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry

Journal of Computational Chemistry DOI 10.1002/jcc

40. Bader, R. F. W.; Anderson, S. G.; Duke, A. J. J Am Chem Soc

1979, 101, 1389.

41. Whitehead, C. E.; Breneman, C. M.; Sukumar, N.; Ryan, M. D. J

Comput Chem 2003, 24, 512.

42. Tugcu, N.; Song, M.; Breneman, C. M.; Sukumar, N.; Bennett, K.

P.; Cramer, S. M. Anal Chem 2003, 75, 3563.

43. Nantasenamat, C.; Naenna, T.; Isarankura Na Ayudhya, C.; Pra-

chayasittikul, V. J Comput Aided Mol Des 2005, 19, 509.

44. Todeschini, R.; Consonni, V.; Mannhold, R.; Kubinyi, H.; Timmer-

man, H. Handbook of Molecular Descriptors; Wiley-VCH: Wein-

heim, 2000.

45. Thanikaivelan, P.; Subramanian, V.; Raghava Rao, J.; Unni Nair, B.

Chem Phys Lett 2000, 323, 59.

46. Karelson, M.; Lobanov, V. S.; Katritzky, A. R. Chem Rev 1996, 96, 1027.

47. Witten, I. H.; Frank, E. Data Mining: Practical Machine Learning Tools

and Techniques, 2nd ed.; Morgan Kaufmann: San Francisco, 2005.

48. Zupan, J.; Gasteiger, J. Neural Networks in Chemistry and Drug

Design, 2nd ed.; Wiley-VCH: Weinheim, 1999.

49. The Unscrambler, version 9.5. Camo Process AS: Norway.

50. Geladi, P.; Kowalski, B. R. Anal Chim Acta 1986, 185, 1.

51. Wold, S.; Sjostrom, M.; Eriksson, L. Chemometr Intell Lab 2001,

58, 109.

52. Haaland, D. M.; Thomas, E. V. Anal Chem 1988, 60, 1193.

53. UFS, version 1.8. University of Portsmouth, UK. Available at http://

www.port.ac.uk/research/cmd/software.

54. Whitley, D. C.; Ford, M. G.; Livingstone, D. J. J Chem Inf Comput

Sci 2000, 40, 1160.

55. Tibshirani, R. J.; Efron, B. Stat Appl Genet Mol Biol 2002, 1, 1.

56. Kearns, M.; Ron, D. Neural Comput 1999, 11, 1427.

57. Mason, L.; Bartlett, P.; Baxter, J. Technical Report; Department of

Systems Engineering, Australian National University, 1998. Avail-

able at http://citeseer.ist.psu.edu/mason98direct.html.

58. Mason, L.; Bartlett, P. L.; Baxter, J. Mach Learn 1999, 38, 243.

59. Valeur, B. Molecular Fluorescence: Principles and Applications;

Wiley-VCH: Weinheim, 2001.

60. Prendergast, F. G. Methods Cell Biol 1999, 58, 1.

61. Kummer, A.; Kompa, C.; Lossau, H.; Pollinger-Dammer, F.;

Michel-Beyerle, M. E.; Silva, C. M.; Bylina, E.; Coleman, W.;

Yang, M.; Youvan, D. Chem Phys 1998, 237, 183.

62. Zimmer, M. Chem Rev 2002, 102, 759.

63. Voityuk, A. A.; Michel-Beyerle, M. E.; Rosch, N. Chem Phys Lett

1998, 296, 269.

64. Weber, W.; Helms, V.; McCammon, J. A.; Langhoff, P. W. Proc

Natl Acad Sci USA 1999, 96, 6177.

65. Takashima, S.; Schwan, H. J Phys Chem 1965, 69, 4176.

66. Gilson, M. K.; Honig, B. H. Biopolymers 1986, 25, 2097.

67. Nakamura, H.; Sakamoto, T.; Wada, A. Protein Eng 1988, 2,

177.

68. Simonson, T.; Brooks, C. L. J Am Chem Soc 1996, 118, 8452.

69. Almeida, J. S. Curr Opin Biotechnol 2002, 13, 72.

1289Prediction of GFP Spectral Properties Using Artificial Neural Network

Journal of Computational Chemistry DOI 10.1002/jcc