Post on 10-May-2023
Prediction of GFP Spectral Properties Using Artificial
Neural Network
CHANIN NANTASENAMAT,1 CHARTCHALERM ISARANKURA-NA-AYUDHYA,1 NATTA TANSILA,1
THANAKORN NAENNA,2 VIRAPONG PRACHAYASITTIKUL1
1Department of Clinical Microbiology, Faculty of Medical Technology, Mahidol University,Bangkok 10700, Thailand
2Department of Industrial Engineering, Faculty of Engineering, Mahidol University, NakhonPathom 73170, Thailand
Received 12 September 2005; Accepted 22 February 2006DOI 10.1002/jcc.20656
Published online 13 February 2007 in Wiley InterScience (www.interscience.wiley.com).
Abstract: The prediction of the excitation and the emission maxima of green fluorescent protein (GFP) chromo-
phores were investigated by a quantitative structure-property relationship study. A data set of 19 GFP color variants
and an additional data set consisting of 29 synthetic GFP chromophores were collected from the literature. Artificial
neural network implementing the back-propagation algorithm was employed. The proposed computational approach
reliably predicted the excitation and the emission maxima of GFP chromophores with correlation coefficient exceed-
ing 0.9. The usefulness of quantum chemical descriptors was revealed by a comparative study with other molecular
descriptors. Assignment of appropriate protonation state of the chromophore for the GFP color variants data set was
shown to be necessary for good predictive performance. Results suggest that the confinement of the GFP chromo-
phore has no significant influence on the predictive performance of the data set used. A comparative investigation
with the traditional modeling methods, particularly multiple linear regression and partial least squares, reveals that
artificial neural network is the most suitable modeling approach for the GFP spectral properties. It is anticipated that
this methodology has great potential in accelerating the design and engineering of novel GFP color variants of scien-
tific or industrial interest.
q 2007 Wiley Periodicals, Inc. J Comput Chem 28: 1275–1289, 2007
Key words: green fluorescent protein; chromophore; absorbance; fluorescence; neural network; partial least squares;
quantitative structure-property relationship
Introduction
Green fluorescent protein (GFP) is an autofluorescent protein of
238 amino acid residues that is isolated from the outer dermal
layer of the Pacific Northwest jellyfish, Aequorea victoria. GFPtakes the shape of a �-barrel made of 11 �-sheets and is 24 and
42 A in diameter and length, respectively. The chromophore
resides in the central �-helix and protected within the confine-
ment of the �-barrel.1 No cofactors or substrates are needed for
GFP to mature and fluoresce, however, the chromophore under-
goes a series of post-translational modifications. Once mature,
GFP is resistant to a variety of destructive forces such as tem-
perature and pH.2 The popularity of GFP began in 1992 when
Prasher et al. cloned and sequenced the cDNA of Aequorea
GFP.3 Shortly after, Chalfie et al. demonstrated that it was possi-
ble for GFP to be expressed in a variety of cell types.4 Many
experiments have shown that GFP is amenable to fusion at ei-
ther the N- or C- terminus without interfering with the function
of both GFP and its fusion protein. Owing to its high stability
and flexibility, GFP has been employed as reporters for gene
expression,5 protein localization,6 protein–protein interaction,7
protein-lipid interaction,8,9 structural and behavioral determina-
tion of macromolecules,10 and as analytical sensors.11,12
The GFP chromophore, p-hydroxybenzylideneimidazolinone,
is formed by the cyclization step involving the nucleophilic attack
of the amide of Gly67 on the carbonyl of Ser65 to form an imida-
zolinone ring.2 This is followed by an oxygenation step, which is
time-dependent and takes approximately 2–4 h for the 1,2-dehy-
Contract/grant sponsor: Royal Golden Jubilee Ph.D. Scholarship, Thailand
Research Fund
Contract/grant sponsor: Thailand Toray Science Foundation
Contract/grant sponsor: Mahidol University; contract/grant number:
02012053-0003
Correspondence to: V. Prachayasittikul; e-mail: mtvpr@mahidol.ac.th
q 2007 Wiley Periodicals, Inc.
drogenation of Tyr66. The native GFP possesses two bands of ex-
citation maxima, a major peak at 395 nm and a minor peak at 475
nm. The presence of the two absorption peaks can be attributed to
the different protonation states of the GFP chromophore. The pro-
tonated (neutral) and deprotonated (anionic) forms of the chromo-
phore absorb at 395 and 475 nm, respectively.2
Much interest has been geared toward the engineering of novel
color variants13–15 of the GFP in light of its wide applicability in
the life sciences. These color variants are made possible by muta-
tions on the tripeptide chromophore. On the other hand, modifica-
tions made to the amino acid residues in the immediate vicinity of
the chromophore control its protonation state, which directly influ-
ences the magnitude of the major and minor absorbance peaks.
Several theoretical studies have been reported on various
aspects of the GFP chromophore, particularly the protonation
state,16–19 cyclization,20 solvent effects,21 fluorescence mecha-
nism,22 as well as prediction of the absorbance spectra23,24 and
excitation maxima of some GFP mutants.25 However, there has
been neither report of a computational approach in the predic-
tion of the emission maxima nor a comprehensive investigation
on calculating the spectral properties of a series of GFP color
variants and synthetic GFP chromophores.
We report herein a quantitative structure-property relationship
(QSPR) study of the quantum chemical descriptors calculated
from GFP chromophores and the quantitative prediction of their
excitation and emission maxima via artificial neural networks
(ANN) (Fig. 1 for an overview), multiple linear regression
(MLR), and partial least squares (PLS). The data sets were
derived from two sources: chromophores from GFP color var-
iants and synthetic GFP chromophores. The descriptors were
obtained from single point calculation (B3LYP/6-31G*) using
geometrically optimized structure (HF/3-21G) of the lowest
energy conformer derived from Monte Carlo or systematic con-
formational search. Predictions for the excitation and the emis-
sion maxima were obtained from back-propagation neural net-
work calculations using optimal parameters acquired from an ex-
haustive empirical search. All the predictive models were
validated by leave-one-out cross-validation (LOO-CV). We
investigated the importance of quantum chemical descriptors,
the significance of the chromophore’s protonation state, and
influence of the protective �-barrel, which encapsulates the chro-
mophore, on predictive performance of spectral properties. To
our knowledge, this is the first report on the use of QSPR in pre-
dicting spectral properties of the GFP color variants and the syn-
thetic GFP chromophores.
Computational Details
Data Collection
The excitation and the emission maxima of 19 GFP mutants
were collected from the literature (Table 1 for individual refer-
ence). Spectral data for 29 synthetic GFP chromophores were
taken from the reported work of Follenius-Wund et al.35 The ini-
tial geometries of the protein-confined chromophores (Fig. 2)
and synthetic GFP chromophores (Table 2) were constructed
using the molecule building module of Spartan’04.36 The three-
dimensional structures of the synthetic GFP chromophores were
built according to the work of Follenius-Wund et al. Likewise,
the molecular structures of the chromophore were drawn accord-
ing to their respective reference (Table 1) but with the backbone
carbons replaced by hydrogen atoms. The protonation states of
the chromophore were taken into consideration in the calculation
of the quantum chemical descriptors (see Table 3 for the calcu-
lated descriptors based on chromophores that did not take into
account the protonation state; also see Table 4 for the calculated
descriptors using based on chromophores that took the protona-
tion state into account). The calculated descriptors of synthetic
GFP chromophores are shown in Table 5.
Molecular Descriptor Calculation
The three-dimensional molecular structure of each GFP chromo-
phores served as inputs for the generation of molecular descrip-
tors by RECON37 and E-DRAGON.38
RECON is based on the construction of a library of precom-
puted atomic fragments by ab initio calculations. The RECON
algorithm39 is based on the theory of atoms in molecules (AIM)
developed by Bader et al.40 in which the property of a molecule
could be described by its atomic constituents. The generated
descriptors, known as transferable atom equivalent (TAE),
describe the molecules in terms of the electron densities, ener-
gies, and properties. Molecular descriptors for molecules of in-
terest are obtained by searching from the database of precom-
puted property of atomic fragments. Atomic fragments bearing
similarities to those found in the database are utilized for the
reconstruction of the molecular property. In spite of the high-
throughput nature of the TAE descriptors, the quality was shown
to match those derived from ab initio calculations at minimal
computational cost.41 In fact, descriptors derived from RECON
have been successfully applied in various QSPR studies.39,42,43
E-DRAGON produced over 1600 molcular descriptors44 com-
prising of 20 types with the quantity of descriptors for each type
shown in parenthesis: Constitutional descriptors (48), topological
descriptors (119), walk and path counts (47), connectivity indi-
ces (33), information indices (47), 2D autocorrelations (96),
Figure 1. Strategy for prediction of the excitation and the emission
maxima of GFP.
1276 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry
Journal of Computational Chemistry DOI 10.1002/jcc
edge adjacency indices (107), BCUT descriptors (64), topologi-
cal charge indices (21), eigenvalue-based indices (44), randic
molecular profiles (41), geometrical descriptors (74), RDF
descriptors (150), 3D-MoRSE descriptors (160), WHIM descrip-
tors (99), GETAWAY descriptors (197), functional group counts
(154), atom-centered fragments (120), charge descriptors (14),
and molecular properties (31).
Quantum Chemical Descriptor Calculation
The molecular structure of each GFP chromophores was optimized
by a Monte Carlo or systematic conformational search for the low-
est energy geometry using Merck Molecular Force Field (MMFF)
followed by geometry optimization at the Hartree-Fock (HF) level
of theory using the 3-21G split-valence basis set (HF/3-21G).
Quantum chemical descriptors were calculated at the density func-
tional theory (DFT) using Becke’s three-parameter Lee-Yang-Parr
(B3LYP) functional and 6-31G* basis set (B3LYP/6-31G*).
The quantum chemical descriptors were calculated using Spar-
tan’04 and includes the following: molecular weight (MW), sur-
face area of a space-filling model (CPKArea), total energy (ETotal),
energy of the highest occupied molecular orbital (EHOMO), energy
of the lowest unoccupied molecular orbital (ELUMO), and dipole
moment (�). In addition, the quantum chemical indices of hard-
ness (�) and electrophilicity (!) were calculated according to the
method summarized by Thanikaivelan et al.45 as follows:
� ¼ ðELUMO � EHOMOÞ2
ð1Þ
! ¼ ðEHOMO þ ELUMO=2Þ22�
ð2Þ
Moreover, the mean absolute atomic charge (Qm) was calcu-
lated according to the method reviewed by Karelson et al.46
from the Mulliken population analysis as follows:
Qm ¼XNa¼1
jQaj=N(3)
where |Qa| represents the absolute value of the charges on all
atoms and N represents the total number of atoms presented in
the chromophore molecule.
The calculated descriptors were subjected to standardization
in order to adjust all the descriptors to approximately the same
scale with a mean of zero and standard deviation of one. Stand-
ardization is performed according to the following equation:
xsinij ¼ xij � xjPNi¼1
ðxij � xjÞ2=N(4)
where xsinij represents the standardized value, xij represents the
value of each sample, xj represents the mean of each descriptor,
and N represents the sample size of the data set.
Generation of Training and Testing Sets
The data set was divided into training and testing sets by the
LOO-CV approach whereby one sample of the data set was left
out as the testing set while the remaining were used as the train-
ing set. The training sets were used to construct a predictive
model and predictions were made on the testing set. This pro-
cess was repeated until all samples of the data set had a chance
Table 1. Dataset of the Chromophore in GFP Color Variants.
No. Chromophore Mutant �exca �em
a References
1 AWG S65A, Y66W, S72A, N1461,
M153T, V163A
434 477 26
2 AYG S65A 471 504 27
3 CYG S65C 479 507 27
4 GYG S65G 487 509 28
5 LYG S65L 484 510 27
6 S(p-amino-F)G Y66(p-amino-F), F99S, M153T, V163A 435 498 29
7 S(p-bromo-F)G Y66(p-bromo-F), F99S, M153T, V163A 375 428 29
8 S(p-iodo-F)G Y66(p-iodo-F), F99S, M153T, V163A 381 438 29
9 S(p-methoxy-F)G Y66(p-methoxy-F), F99S, M153T, V163A 394 460 29
10 SFG Y66F 360 442 27
11 SHG Y66H 382 448 27
12 SWG Y66W 436 485 27
13 SYA G67A 410 454 30
14 SYG None or Q80R 395 508 31
15 T(3-fluoro-Y)G F64L, S65T, Y66(3-fluoro-Y) 484 514 13
16 T(4-amino-W)G W57(4-amino-W), F64L, S65T, Y66
(4-amino-W), N1461, M153T, V163A
466 574 32
17 THG F64L, S65T, Y66H, Y145F 380 440 33
18 TWG F64L, S65T, Y66W 450 494 34
19 TYG S65T 488 511 27
aBoth are the maxima of the excitation and the emission wavelength (nm).
1277Prediction of GFP Spectral Properties Using Artificial Neural Network
Journal of Computational Chemistry DOI 10.1002/jcc
Figure 2. Chemical structures of the chromophores of GFP color variants data set (refer to Table 1
for the name of the chromophore).
1278 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry
Journal of Computational Chemistry DOI 10.1002/jcc
to be used as the testing set. Network training was performed
with LOO-CV using Weka, version 3.4.5.47
Methods of Neural Network Training
Data from the data sets were submitted for network training
using the back-propagation neural network implementation of
Weka, version 3.4.5. The molecular descriptors were presented
to the input layer where they were relayed to the nodes of the
hidden layer for processing and finally onto the output nodes.
Network training using the empirically determined network pa-
rameters began with random initialization of weights between
the network nodes. As training proceeds, the weights were ad-
justed according to the back-propagation of error approach.
Briefly, information on the prediction error is propagated back-
wards from the output layer to the hidden and input layer and
followed by adjustments of the weights according to the predic-
tion error.48 Each initialization of weights differed but gave
slightly varying prediction output. The predicted outputs were
derived from the average of 10 runs of network training.
Search for Optimal Network Parameters
The search for optimal parameters for carrying out neural net-
work calculations was obtained by an empirical trial-and-error
approach in which the parameter(s) under investigation was sub-
jected to an incremental increase using root mean square error
(RMS) as a measure of predictive error, which is calculated with
the following equation:
RMS ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPNi¼1
ðpi � aiÞ2
n
vuuut(5)
where pi represents the predicted output, ai represents the actual
output, and n represents the number of chromophore molecules
presented in the data set. The search for optimal neural network
parameters includes the number of nodes in the hidden layer,
number of learning epochs, learning rate, and momentum con-
stants.
Multiple Linear Regression
The MLR models were generated by The Unscrambler 9.549
software package to obtain equations of the form:
Y ¼ B0 þX
BnXn (6)
in which Y is the GFP spectral property under investigation (ex-
citation or emission maxima), B0 is the intercept, and Bn are the
regression coefficients of descriptors Xn.
Partial Least Squares Regression
PLS50,51 analysis was performed with The Unscrambler 9.5 soft-
ware package using the PLS1 algorithm. Each chromophore was
described by nine quantum chemical descriptors and predictions
made on the response variable, which in our case are the GFP
spectral properties. The descriptors were preprocessed by mean-
centering and autoscaling to zero mean and unit variance
according to eq. (4). The number of variables in the descriptor
matrix is then reduced to a small number of latent variables
known as PLS components (PC), which still retains the main in-
formation from the original data set. The PCs are few, orthogo-
nal, and serve as predictors of the response variable.51 The opti-
mal number of PLS components were determined according to
the method of Haaland and Thomas52 from a plot of PC against
mean squared error (MSE) using LOO-CV. MSE is calculated
according to the following equation:
Table 2. Chemical Structures of the Synthetic GFP Chromophore Data Set.
Chromophore R1 R2 R3 �exca �em
a
I-1 OH Me Me 370 440
I-2 OH Me (CH2)3Me 372 436
I-3 OH Ph H 399 467
I-4 OH 3,4-diMeOPh Me 398 476
I-5 H Ph H 384 444
I-6 H Ph OH 385 443
I-7 H Ph Ph 380 453
I-8 H 4-MeOPh H 387 452
I-9 H 3,4-diMeOPh H 390 457
I-10 H 3,4-diMeOPh Me 386 470
I-11 MeO Ph H 398 464
I-12 MeO Ph Me 394 471
I-13 MeO Ph CH2COOEt 389 465
I-14 MeO Ph CH2Ph 393 468
I-15 MeO 4-NO2Ph H 428 573
I-16 OCH2COOEt Ph H 397 461
I-17 H, 2-MeO Ph H 400 465
I-18 N(Me)2 Me H 415 483
I-19 N(Me)2 Ph H 454 520
I-20 N(Me)2 4-MeOPh H 450 507
I-21 N(Me)2 3,4-diMeOPh H 455 515
I-22 CF3 Ph H 383 451
I-23 CN Ph H 393 464
I-24 COOMe Ph H 394 461
I-25 CN 4-MeOPh H 401 482
I-26 CN 3,4-diMeOPh H 405 482
I-27 COOMe 3,4-diMeOPh H 403 478
I-28 CN 4-NO2Ph H 406 508
I-29 COOMe 4-NO2Ph H 405 510
Data from ref. 35.aBoth are the maxima of excitation and emission wavelength (nm).
MSE ¼PNi¼1
ðpi � aiÞ2
n(7)
1279Prediction of GFP Spectral Properties Using Artificial Neural Network
Journal of Computational Chemistry DOI 10.1002/jcc
where pi represents the predicted output, ai represents the actual
output, and n represents the number of chromophore molecules
presented in the data set.
Evaluation of Prediction Accuracy
To evaluate the ability of the learning methods to predict the ex-
citation and the emission maxima, root mean square error
(RMS) was used as a measure of prediction error. After predic-
tions of the excitation and the emission maxima were made, the
correlation coefficient r was used to assess the correlation
between the predicted and experimental values.
Results and Discussion
The data set is comprised of the chromophores of various GFP
color variants and synthetic GFP chromophores. The spectral
data of GFP mutants were drawn from the literature by focusing
on color variants of GFP. The presence of mutations in the pro-
Table 3. Calculated Quantum Chemical Descriptors of the Chromophores of GFP Color Variants Without
Taking into Account the Protonation State.
Chromophore MW CPKArea ETotal EHOMO ELUMO � � ! Qm
AWG 298.346 323.050 �989.923 �4.710 �1.210 4.300 �656.487 2.503 0.252
AYG 273.292 296.010 �932.429 �5.530 �1.730 4.690 �614.219 3.468 0.293
CYG 305.358 316.440 �1330.610 �5.650 �1.860 2.680 �823.526 3.720 0.289
GYG 259.265 276.890 �893.112 �5.530 �1.730 4.740 �585.001 3.468 0.302
LYG 315.373 351.710 �1050.370 �5.520 �1.810 2.350 �701.038 3.621 0.268
S(p-amino-F)G 288.307 306.190 �987.778 �5.190 �1.640 3.290 �646.984 3.285 0.305
S(p-bromo-F)G 352.188 312.800 �944.991 �6.080 �2.240 3.110 �628.896 4.507 0.272
S(p-iodo-F)G 399.188 319.950 �943.206 �6.080 �2.260 3.150 �631.578 4.552 0.271
S(p-methoxy-F)G 303.318 322.960 �1046.950 �5.610 �1.860 2.690 �684.953 3.720 0.285
SFG 273.292 292.720 �932.422 �5.990 �2.050 1.690 �612.571 4.102 0.274
SHG 263.257 273.230 �926.395 �5.760 �1.800 4.430 �599.812 3.608 0.319
SWG 314.345 328.870 �1065.140 �4.870 �1.380 3.330 �697.003 2.798 0.261
SYA 303.318 321.440 �1046.950 �5.850 �2.100 6.010 �684.197 4.214 0.293
SYG 289.291 311.280 �1007.630 �5.710 �2.050 2.610 �659.455 4.113 0.296
T(3-fluoro-Y)G 321.308 325.740 �1146.190 �5.820 �2.040 2.130 �735.965 4.086 0.312
T(4-amino-W)G 343.387 356.950 �1159.810 �4.640 �1.400 1.450 �758.382 2.815 0.292
THG 277.284 292.490 �965.713 �5.750 �1.790 4.520 �629.101 3.589 0.316
TWG 328.372 348.100 �1104.450 �4.860 �1.370 3.350 �726.277 2.780 0.263
TYG 303.318 321.110 �1046.960 �5.680 �1.890 2.730 �684.034 3.780 0.301
Table 4. Calculated Quantum Chemical Descriptors of the Chromophores of GFP Color Variants Taking into
Account the Protonation State.
Chromophore MW CPKArea ETotal EHOMO ELUMO � � ! Qm
AWG 298.346 323.050 �989.923 �4.710 �1.210 4.300 1.750 2.503 0.252
AYG 272.284 293.340 �931.891 �1.090 1.920 12.080 1.505 0.057 0.279
CYG 304.350 313.720 �1330.080 �1.230 1.770 10.380 1.500 0.024 0.277
GYG 258.257 274.180 �892.574 �1.080 1.920 11.280 1.500 0.059 0.286
LYG 314.365 350.280 �1049.830 �1.140 1.860 14.190 1.500 0.043 0.258
S(p-amino-F)G 288.307 306.190 �987.778 �5.190 �1.640 3.290 1.775 3.285 0.305
S(p-bromo-F)G 352.188 312.800 �944.991 �6.080 �2.240 3.110 1.920 4.507 0.272
S(p-iodo-F)G 399.188 319.950 �943.206 �6.080 �2.260 3.150 1.910 4.552 0.271
S(p-methoxy-F)G 303.318 322.960 �1046.950 �5.610 �1.860 2.690 1.875 3.720 0.285
SFG 273.292 292.720 �932.422 �5.990 �2.050 1.690 1.970 4.102 0.274
SHG 263.257 273.230 �926.395 �5.760 �1.800 4.430 1.980 3.608 0.319
SWG 314.345 328.870 �1065.140 �4.870 �1.380 3.330 1.745 2.798 0.261
SYA 303.318 321.440 �1046.950 �5.850 �2.100 6.010 1.875 4.214 0.293
SYG 289.291 311.280 �1007.630 �5.710 �2.050 2.610 1.830 4.113 0.296
T(3-fluoro-Y)G 320.300 326.130 �1145.650 �1.280 1.670 11.540 1.475 0.013 0.298
T(4-amino-W)G 343.387 356.950 �1159.810 �4.640 �1.400 1.450 1.620 2.815 0.292
THG 277.284 292.490 �965.713 �5.750 �1.790 4.520 1.980 3.589 0.316
TWG 328.372 348.100 �1104.450 �4.860 �1.370 3.350 1.745 2.780 0.263
TYG 302.310 318.290 �1046.420 �1.260 1.740 12.550 1.500 0.019 0.289
1280 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry
Journal of Computational Chemistry DOI 10.1002/jcc
tein environment was assumed to have no significant effect on
the spectral properties of the GFP. Therefore, a total of 19 GFP
mutants with unique chromophore tripeptides were obtained.
The chromophores of these GFP mutants were truncated from
the protein backbone by cleavage at the peptide bond connecting
to the chromophore, followed by addition of a hydrogen atom
to the chromophore to fill the valences. The spectral information
of synthetic GFP chromophores was taken from the work of
Follenius-Wund et al.35
There are a variety of molecular descriptors to use that are ca-
pable of describing molecules of interest in a quantitative struc-
ture-property manner. However, the selection of appropriate type
of descriptors depends on the property under investigation. In this
study, we are interested in predicting spectral properties of the
GFP chromophore. Therefore, the most suitable descriptors should
be able to account for the molecular mechanism governing the ex-
citation and the emission phenomena. Voityuk et al. explained that
the chromophore absorption peak arises from the HOMO-LUMO
electronic transition,25 which warrants investigation into its effec-
tiveness in the prediction of GFP spectral properties. Furthermore,
the electrons of the HOMO and LUMO of the chromophore are
delocalized over the entire molecule through � orbitals, as shown
in Figure 3. Moreover, the HOMO-LUMO electronic transition is
accompanied by a charge transfer in the chromophore where the
electron density is withdrawn from the phenol ring and transmitted
to the heterocyclic imidazole ring. Thus, modifications made to
the chromophore changes the distribution of its electron density,
thereby directly influencing the spectral properties.
To test whether the energies of the highest occupied molecu-
lar orbital (HOMO) and the lowest unoccupied molecular orbital
(LUMO) are relevant and crucial towards the prediction of spec-
tral properties of the GFP, we compared the efficiency of molec-
ular descriptors generated by three softwares, namely RECON,
E-DRAGON, and Spartan’04. The RECON software produces
descriptors accounting for the electron densities, energies, and
properties as described previously. The E-DRAGON software
generates 1600 molecular descriptors consisting of over 20 dif-
ferent types of descriptors as described in Methods. For this
investigation, Spartan’04 was used to generate six molecular
descriptors comprising of three quantum chemical descriptors
(ETotal, EHOMO, and ELUMO), one constitutional descriptor
(MW), one charge descriptor (�), and one geometrical descriptor
(CPK area). Furthermore, three additional molecular descriptors
consisting of two quantum chemical descriptors (� and !) and
one charge descriptor (Qm) were derived from the equations
described in Methods. Table 6 summarizes the performance of
the four data sets using the different molecular descriptors as
input variables. The molecular descriptors generated by RECON
Table 5. Calculated Quantum Chemical Descriptors of the Synthetic GFP Chromophore Data Set.
Chromophore MW CPKArea ETotal EHOMO ELUMO � � ! Qm
i-1 216.240 244.210 �724.460 �5.490 �1.670 4.110 1.910 3.355 0.291
i-2 258.321 304.900 �842.400 �5.470 �1.650 4.010 1.910 3.318 0.264
i-3 264.284 284.740 �876.880 �5.490 �2.020 4.690 1.735 4.063 0.262
i-4 338.363 362.270 �1145.230 �5.290 �1.780 6.780 1.755 3.560 0.266
i-5 248.285 275.580 �801.670 �5.760 �2.140 3.520 1.810 4.310 0.229
i-6 264.284 283.180 �876.820 �5.840 �2.320 2.180 1.760 4.729 0.232
i-7 324.383 351.690 �1032.710 �5.730 �2.070 3.120 1.830 4.156 0.199
i-8 278.311 305.870 �916.190 �5.550 �1.980 5.240 1.785 3.971 0.247
i-9 308.337 335.640 �1030.700 �5.530 �1.960 6.000 1.785 3.929 0.254
i-10 322.364 353.100 �1070.010 �5.500 �1.900 5.570 1.800 3.803 0.242
i-11 278.311 305.870 �916.190 �5.420 �1.980 4.760 1.720 3.980 0.246
i-12 292.338 323.630 �955.500 �5.380 �1.880 4.480 1.750 3.765 0.232
i-13 366.373 381.630 �1258.590 �5.490 �1.910 3.850 1.790 3.824 0.261
i-14 368.436 403.130 �1186.550 �5.420 �1.860 4.420 1.780 3.722 0.211
i-15 323.308 331.340 �1120.690 �5.770 �3.000 3.430 1.385 6.942 0.271
i-16 352.346 370.790 �1219.270 �5.530 �2.060 7.170 1.735 4.150 0.271
i-17 280.327 308.130 �917.350 �5.640 �2.100 3.190 1.770 4.231 0.235
i-18 229.283 267.160 �743.900 �4.900 �1.410 4.590 1.745 2.852 0.277
i-19 291.354 326.880 �935.640 �4.900 �1.750 5.090 1.575 3.510 0.243
i-20 321.380 357.200 �1050.160 �4.790 �1.600 6.600 1.595 3.200 0.255
i-21 351.406 386.960 �1164.670 �4.780 �1.590 7.290 1.595 3.180 0.261
i-22 316.282 310.640 �1138.700 �6.070 �2.450 5.090 1.810 5.013 0.253
i-23 273.295 296.170 �893.910 �6.180 �2.640 6.800 1.770 5.494 0.245
i-24 306.321 328.350 �1029.550 �5.950 �2.420 5.720 1.765 4.962 0.254
i-25 303.321 326.440 �1008.430 �5.930 �2.490 7.490 1.720 5.152 0.261
i-26 333.347 356.210 �1122.950 �5.900 �2.480 7.960 1.710 5.133 0.267
i-27 364.401 397.860 �1222.630 �5.370 �1.990 6.130 1.690 4.007 0.261
i-28 318.292 321.650 �1098.410 �6.570 �3.380 4.320 1.595 7.759 0.272
i-29 351.318 353.840 �1234.040 �6.340 �3.230 0.240 1.555 7.362 0.277
1281Prediction of GFP Spectral Properties Using Artificial Neural Network
Journal of Computational Chemistry DOI 10.1002/jcc
and E-DRAGON were subjected to descriptor reduction accord-
ing to the unsupervised forward selection (UFS)53 algorithm as
described previously.54 Next, the performance of the three dif-
ferent descriptors were evaluated from ANN using the default
network parameters. Of all the descriptors generated, those from
Spartan’04 gave superior predictive performance over that of
RECON and E-DRAGON. Therefore, we concluded that the
quantum chemical descriptors derived from Spartan’04 were
most suitable for training the artificial neural network.
In artificial neural networks, optimal parameters are not uni-
versal for all types of data; rather, optimal parameters are
obtained by an exhaustive search. To find the optimal parame-
ters for neural network calculations, an empirical trial-and-error
search was performed using incremental increase in the value of
the parameter under investigation and using RMS as the measure
of prediction error. In this study, the first parameter to be opti-
mized is the number of nodes in the hidden layer (Figs. 4a, 4b,
5a, and 5b), followed by the number of learning epochs (Figs.
4c, 4d, 5c, and 5d), and finally the learning rate and momentum
constants (Figs. 4e, 4f, 5e, and 5f) for the excitation and the
emission maxima data set, respectively.
The network architecture used in this study comprises of
three layers including the input layer, hidden layer, and output
layer (Fig. 6). Data of the nine molecular descriptors (MW,
CPKarea, ETotal, EHOMO, ELUMO, �, �, !, and Qm) served as
inputs with the excitation and the emission maxima as the out-
puts. Thus, there are nine nodes in the input layer and two nodes
in the output layer. As for the hidden layer, the optimal value
was determined by an empirical trial-and-error search over the
ranges of 1–24 nodes. In silico prediction of the excitation and
the emission maxima were performed by using the empirically
determined network parameters and the average of 10 runs were
used for the output variables.
In this study, LOO-CV was used for model validation in
which repeated resampling of the data was carried out as fol-
lows. One sample of the data set was left out as the testing set
and training was performed on the remaining samples. The pre-
dictive model that was obtained from training was then tested
on the sample that was left out. This is carried out reiteratively
until all samples were left out for prediction. When the data set
is small, partitioning it into training and testing set is not feasi-
ble as it would result in insufficient training data to be used for
the construction of the predictive model. Thus, the use of LOO-
Figure 3. Calculated HOMO (a) and LUMO (b) of the neutral form
of the chromophore. [Color figure can be viewed in the online issue,
which is available at www.interscience.wiley.com.]
Table 6. Summary of the Predictive Performance as a Function of the Descriptor Used.
Model
RECON E-DRAGON Spartan’04
na rTRb rCV
c RMSTRd RMSCV
e na rTRb rCV
c RMSTRd RMSCV
e na rTRb rCV
c RMSTRd RMSCV
e
1f 15 1.000 0.431 0.013 49.104 17 0.999 0.276 2.023 58.464 9 0.997 0.970 3.620 10.707
2g 15 1.000 0.111 0.247 59.935 17 0.995 0.392 3.446 38.166 9 0.998 0.922 2.410 14.516
3h 18 0.996 0.803 1.913 13.194 28 1.000 0.570 0.601 19.411 9 0.993 0.935 2.784 8.049
4i 18 0.997 0.617 2.718 27.223 28 1.000 0.434 0.115 30.937 9 0.991 0.939 4.111 10.392
aNumber of chromophores in the data set.bTraining correlation coefficient.cCross-validated correlation coefficient.dRoot mean square error for training.eRoot mean square error for leave-one-out cross-validation.fGFP color variants (Excitation maxima) data set.gGFP color variants (Emission maxima) data set.hSynthetic GFP chromophore (Excitation maxima) data set.iSynthetic GFP chromophore (Emission maxima) data set.
1282 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry
Journal of Computational Chemistry DOI 10.1002/jcc
CV is appropriate for small data sets as it allows the economical
use of the data.47 However, there are potential cautions that one
should consider when using the LOO approach. Firstly, the com-
putational demand of LOO-CV is high since training is per-
formed for n-folds, where n being the sample size, which may
not be suitable for large data set. It has also been mentioned that
LOO does not perturb the data set sufficiently and may result in
high variance estimates and overfitting of the model.55 Further-
Figure 4. Optimization of neural network parameters of the GFP color variants data set. (a, b) Root
mean square error (RMS) as a function of the number of nodes in the hidden layer for the excitation
(a) and the emission (b) maxima data set. (c, d) RMS as a function of the number of learning epochs
for the excitation (c) and the emission (d) maxima data set. (e, f) Contour plot of RMS versus the
learning rate and the momentum constants for the excitation (e) and the emission (f) maxima data set.
The red lines represent constant value of the RMS, while n represent RMS values obtained from the
learning procedure, which is fitted onto the same surface model of the contour plot. [Color figure can
be viewed in the online issue, which is available at www.interscience.wiley.com.]
1283Prediction of GFP Spectral Properties Using Artificial Neural Network
Journal of Computational Chemistry DOI 10.1002/jcc
more, stratification of the data set is not possible since only one
sample at a time is used as the test set. Nevertheless, LOO has
been assured to perform at similar level to that of the training
error estimate in the worst case scenario,56 and it has been known
to provide a good estimate of the generalization error,57,58 which
in our case is the RMS.
To assign the correct protonation states on the chromophores,
an investigation of the absorbance spectra in the literature was
Figure 5. Optimization of neural network parameters of the synthetic GFP chromophore data set. (a,
b) Root mean square error (RMS) as a function of the number of nodes in the hidden layer for the ex-
citation (a) and the emission (b) maxima data set. (c, d) RMS as a function of the number of learning
epochs for the excitation (c) and the emission (d) maxima data set. (e, f) Contour plot of RMS versus
the learning rate and the momentum constants for the excitation (e) and the emission (f) maxima data
set. The red lines represent constant value of the RMS, while n represent RMS values obtained from
the learning procedure, which is fitted onto the same surface model of the contour plot. [Color figure
can be viewed in the online issue, which is available at www.interscience.wiley.com.]
1284 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry
Journal of Computational Chemistry DOI 10.1002/jcc
carried out. On the basis of the original articles, the chromo-
phores were categorized to either the protonated or deprotonated
form. Since the major peak at 395 nm of the absorbance spectra
is associated with the protonated form of the chromophore, the
wild-type GFP with the chromophore tripeptide SYG was
assigned the protonated form. Likewise, GFP mutants with
minor peaks were associated with the anionic chromophore and
so were categorized as deprotonated. Assignment of the correct
protonation state was performed on chromophores harboring the
p-hydroxybenzylidene. The excitation and the emission maxima
of chromophores without consideration of the protonation state
gave correlation coefficient and root mean square error of r ¼0.3272, RMS ¼ 57.7310 and r ¼ 0.7209, RMS ¼ 32.1526,
respectively (Figs. 7a and 7b). In addition, the excitation and the
emission maxima of chromophores accounting for the protona-
tion state gave correlation coefficient and root mean square error
of r ¼ 0.9795, RMS ¼ 8.8237 and r ¼ 0.9067, RMS ¼15.7614, respectively (Figs. 7c and 7d). On the basis of these
results, the correct protonation state of the chromophore is cru-
cial for accurate prediction of the excitation and the emission
maxima of the GFP chromophores studied. Chromophores drawn
without taking the protonation states into consideration gave
poor prediction accuracy because the descriptors derived from
these structures were irrelevant to the spectral properties under
investigation. By not taking into account the chromophore’s pro-
tonation state, the predictive model could not distinguish
between the major and minor peaks when given descriptors
derived from the same protonation state as inputs.
The evidences suggested that synthetic GFP chromophores
had only one protonation form as observed by the absorbance
spectra. The correlation coefficients and root mean square error
of the excitation and the emission maxima were r ¼ 0.9335,
RMS ¼ 9.9095 and r ¼ 0.9626, RMS ¼ 9.7508, respectively
(Figs. 7e and 7f).
To determine whether presence of the �-barrel have any
influence on predictive performance of the spectral properties,
we made comparisons between the prediction accuracy of data
set derived from the chromophores of GFP color variants and
data set comprising of the synthetic GFP chromophores. The
GFP color variants data set is comprised of chromophores
encapsulated within the protective boundaries of the proteins �-barrel, shielding it from exposure to neighboring solvents. On
the other hand, the synthetic GFP chromophore data set consists
of bare chromophores exposed to the quenching effect of solvent
molecules. We confirmed the importance of quantum chemical
descriptors in calculating the spectral properties regardless of
their confinement within the �-barrel or their interaction with
neighboring side chains. This is not to say that the hydrogen-
bonding network between the chromophore and its immediate
vicinity plays no crucial role in the spectral properties, rather
these interactions are important in governing the protonation
states of the chromophore and the correct account of which
facilitates accurate calculation of the excitation and the emission
maxima, as previously mentioned.
For comparison, the traditional modeling approach in QSPR
studies, particularly MLR and PLS regression analysis, was used
to predict the GFP spectral properties. All descriptors as used in
ANN analysis were employed in the calculations. The regression
coefficients derived from MLR and PLS are shown in Tables 7
and 8, respectively. Comparisons of the three learning algo-
rithms are shown in Tables 9–11. Although all learning algo-
rithms were capable of predicting the spectral properties with ac-
curacy in the range of 0.8399 � r � 0.9795, only ANN showed
consistent superior prediction accuracy over that of both MLR
and PLS for the four models. From the prediction results, it was
observed that all learning approach performed well on model 1
with precision in the range of 0.9498 � r � 0.9795. For model
2, PLS performed slightly better than ANN and surpass that of
MLR. It is observed that ANN outperformed both MLR and
PLS for models 3 and 4, thus suggesting that both models are of
non-linear nature. As shown in Table 10, it is also interesting to
note that the performance of PLS, as observed by the LOO-CV
r, is positively correlated with the total percent of explained var-
iance s2CV. This reveals that the total variability accounted by
the PLS model is crucial towards the predictive performance.
It is observed that the predictive performance of excitation
maxima by various learning approaches is better than that of
emission maxima for models 1 and 2. This is to be expected as
the excitation property of chromophores is well correlated with
its molecular structure.59 However, the emission phenomenon of
chromophores is of a complex nature since a quantity of energy
is dissipated as heat to variable degree depending on the molec-
ular structure. The reduction of energy during emission causes a
red-shift of the wavelength because of the inverse relationship
between energy and wavelength as explained by Planck’s equa-
tion:
E ¼ h� ¼ hc=� (8)
where E is energy, h is the Planck’s constant, � is the frequency,
c is the speed of light, and � is the wavelength.
Figure 6. Scheme of artificial neural network used in this study. The
network is comprised of three layers: input layer, hidden layer, and
output layer. Signals are propagated in a feed-forward manner from
the input layer through the hidden layer and onto the output layer fol-
lowed by adjustment of the weights according to the prediction error
(see ref. 48 for more detail). Nodes are represented by squares while
weights are represented by arrows. [Color figure can be viewed in the
online issue, which is available at www.interscience.wiley.com.]
1285Prediction of GFP Spectral Properties Using Artificial Neural Network
Journal of Computational Chemistry DOI 10.1002/jcc
Figure 7. Predicting the excitation and the emission maxima of GFP color variants. (a–d) Plot of the pre-
dicted versus experimental values of the excitation (a, c) and the emission (b, d) maxima for calculations
made based on chromophores not taking the protonation state into account (a, b) and for computations
made based on chromophores taking the protonation state into consideration (c, d). (e, f) Plot of the pre-
dicted versus experimental values of the excitation (e) and the emission (f) maxima for synthetic GFP
chromophores. n and solid lines represent samples and regression line of the leave-one-out cross-validated
test set, respectively. & and dotted lines represent samples and regression line of the training set, respec-
tively. [Color figure can be viewed in the online issue, which is available at www.interscience.wiley.com.]
1286 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry
Journal of Computational Chemistry DOI 10.1002/jcc
The performance of ANN on models 3 and 4 is superior to
those of MLR and PLS. These results indicate the possible influ-
ence of the microenvironment of the chromophores on the per-
formance of the various learning approaches. As we have dis-
cussed previously, the chromophores for models 1 and 2 were
derived from various GFP color variants where they are shielded
from solvent molecules within the protective confinement of the
�-barrel. On the other hand, the chromophores for models 3 and
4 are solvent exposed and are prone to collisional quenching
effect of oxygen and water molecules.60 Furthermore, alteration
in the electron flux2,61 of its microenvironment is due to absence
of the hydrogen bonding network1,62 may additionally influence
the spectral properties of these isolated chromophores. For
example, the lack of the intricate hydrogen bonding network
may contribute to greater conformational freedom of the chro-
mophore resulting in an accelerated internal conversion,35,61
which distorts the coplanarity of the benzyl and imidazole rings
as suggested by quantum chemical simulations.63,64 It should be
noted that the quantum chemical calculations were performed
in vacuo and possible solvent effects on the chromophore were
unaccounted for by these computations. It is generally known
that the native GFP chromophore resides in a fairly protective
microenvironment and the interior dielectric constant is assumed
to be low (" ¼ 2–4).65–68 Since the difference in dielectric con-
stant of the chromophore microenvironment and that of vacuum
(" ¼ 1) is negligible, we assume that this did not affect the pre-
dictive performance. This is observed by the superior perform-
ance of all learning methods on Model 1 where r is in excess of
0.94. However, quite the opposite is valid for the isolated chro-
mophores since they are solvent exposed where the dielectric
constant (" ¼ 80) is significantly different from that of vacuum.
This is not to say that the descriptors are inadequate in describ-
ing the spectral phenomenon, rather it may not be suitable for
linear regression analysis. Since ANN is known to be suitable in
cases where the mechanistic understanding is not well known,69
it is capable of modeling the spectral properties without knowl-
edge of the solvent effect. Moreover, the predictive performance
Table 7. Regression Coefficients of MLR Models.
Descriptors Model 1a Model 2b Model 3c Model 4d
MW 9.240E�01 �1.093E+01 4.344E+01 3.858E+01
CPKArea �3.820E�01 5.067E+00 3.165E+01 2.266E+01
ETotal �1.169E+06 1.295E+06 3.850E+05 �3.469E+05
EHOMO 9.719E+05 �1.077E+06 �4.302E+05 3.877E+05
ELUMO 1.868E+01 �3.060E+01 2.736E+00 4.754E+00
� �2.811E+00 1.712E+01 �1.408E+01 �1.558E+01
� �2.113E+05 2.340E+05 1.986E+05 �1.790E+05
! �6.160E+01 2.340E+01 �2.138E+01 7.946E+01
Qm 1.355E+00 1.247E+01 2.926E+00 �4.780E�01
aGFP color variants (excitation maxima) data set, Y-intercept is 431.104.bGFP color variants (emission maxima) data set, Y-intercept is 484.264.cSynthetic GFP chromophore (excitation maxima) data set, Y-intercept is400.458.dSynthetic GFP chromophore (emission maxima) data set, Y-intercept is
474.712.
Table 8. Regression Coefficients of PLS Models.
Descriptors Model 1a Model 2b Model 3c Model 4d
MW �2.241E+00 �7.401E+00 �3.700E�01 2.880E�01
CPKArea 3.638E+00 7.964E+00 1.854E+00 6.530E�01
ETotal 6.257E+00 7.515E+00 9.421E+00 8.755E+00
EHOMO 6.790E�01 �1.600E�01 1.570E�01 �2.770E+00
ELUMO �8.815E+00 �3.058E+01 3.537E+00 �7.600E�01
� 1.107E+01 1.991E+01 7.010E�01 9.890E�01
� �3.151E+01 �4.233E+01 �1.792E+01 �2.297E+01
! �1.315E+01 �5.987E+00 7.050E�01 5.192E+00
Qm 5.271E+00 1.510E+01 �1.293E+00 3.547E+00
aGFP color variants (excitation maxima) data set, Y-intercept is 431.105.bGFP color variants (emission maxima) data set, Y-intercept is 484.263.cSynthetic GFP chromophore (excitation maxima) data set, Y-intercept is
400.483.dSynthetic GFP chromophore (emission maxima) data set, Y-intercept is474.690.
Table 9. Performance of MLR Models.
Model na rTRb rCV
c RMSTRd RMSCV
e
1f 19 0.9910 0.9498 5.8537 13.9642
2g 19 0.9770 0.8412 7.6775 21.0530
3h 29 0.9480 0.8399 6.7854 13.0558
4i 29 0.9775 0.9430 6.1122 9.7858
aNumber of chromophores in the data set.bTraining correlation coefficient.cCross-validated correlation coefficient.dRoot mean square error for training.eRoot mean square error for leave-one-out cross-validation.fGFP color variants (excitation maxima) data set.gGFP color variants (emission maxima) data set.hSynthetic GFP chromophore (excitation maxima) data set.iSynthetic GFP chromophore (emission maxima) data set.
Table 10. Performance of PLS Models.
Model na rTRb rCV
c NPCd s2TR
e s2CVf RMSTR
g RMSCVh
1i 19 0.9780 0.9496 5 95.655 90.331 9.1145 14.3518
2j 19 0.9696 0.9237 5 94.016 86.455 8.8030 13.9798
3k 29 0.9425 0.8892 4 88.825 79.741 7.1232 9.9334
4l 29 0.9603 0.9363 3 92.216 88.493 8.0876 10.1841
aNumber of chromophores in the data set.bTraining correlation coefficient.cCross-validated correlation coefficient.dNumber of PLS components.eTotal % explained variance for training.fTotal % explained variance for leave-one-out cross-validation.gRoot mean square error for training.hRoot mean square error for leave-one-out cross-validation.iGFP color variants (excitation maxima) data set.jGFP color variants (emission maxima) data set.kSynthetic GFP chromophore (excitation maxima) data set.lSynthetic GFP chromophore (emission maxima) data set.
1287Prediction of GFP Spectral Properties Using Artificial Neural Network
Journal of Computational Chemistry DOI 10.1002/jcc
of excitation maxima for models 3 and 4 is lower than that of
emission maxima as opposed to what was observed for models 1
and 2. This contradicts the assumption that the phenomenon of
excitation is straightforward, and therefore should have higher
predictive performance. Perhaps the excitation phenomenon of
isolated chromophores may be less straightforward than previ-
ously thought as their microenvironments greatly influence their
spectral properties.
Conclusion
We have demonstrated a novel computational method that per-
mits the rational design of GFP color variants, allowing the
effects of chromophore mutations on the spectral properties to be
studied. The predicted excitation and emission maxima of the
GFP chromophores, using back-propagation neural network, were
found to be in good agreement with the experimental values. Our
results indicate that the protonation state of the chromophores,
used for the generation of quantum chemical descriptors, is use-
ful and necessary to obtain satisfactory spectral predictions that
are representative of the experimentally determined values. It
was also shown that regardless of the confinement of the chro-
mophore, the proposed methodology was capable of performing
well on the spectral predictions. Of the three learning methods
used in this study, ANN was found to outperform the traditional
linear regression analysis, particularly MLR and PLS. Overall,
the strategy proposed by us facilitates an in silico approach to
the design of novel GFP color variants as well as synthetic GFP
chromophores, which can then be validated by experimental stud-
ies. Furthermore, the approach used in this study has broad
implications as it could be applied for prediction of the spectral
properties of other fluorescent compounds or proteins.
References
1. Ormo, M.; Cubitt, A. B.; Kallio, K.; Gross, L. A.; Tsien, R. Y.;
Remington, S. J.; Science 1996, 273, 1392.
2. Tsien, R. Y. Annu Rev Biochem 1998, 67, 509.
3. Prasher, D. C.; Eckenrode, V. K.; Ward, W. W.; Prendergast, F. G.;
Cormier, M. J. Gene 1992, 111, 229.
4. Chalfie, M.; Tu, Y.; Euskirchen, G.; Ward, W. W.; Prasher, D. C.
Science 1994, 263, 802.
5. Sun, Y.; Wong, M. D.; Rosen, B. P. J Biol Chem 2001, 276, 14955.
6. Lippincott-Schwartz, J.; Snapp, E.; Kenworthy, A. Nat Rev Mol Cell
Biol 2001, 2, 444.
7. Hink, M. A.; Bisselin, T.; Visser, A. J. Plant Mol Biol 2002, 50, 871.
8. Prachayasittikul, V.; Isarankura Na Ayudhya, C.; Boonpangrak, S.;
Galla, H.-J. J Membr Biol 2004, 200, 47.
9. Isarankura Na Ayudhya, C.; Prachayasittikul, V.; Galla, H.-J. Eur
Biophys J 2004, 33, 522.
10. Prachayasittikul, V.; Isarankura Na Ayudhya, C.; Tantimongcolwat,
T.; Galla, H.-J. Biochem Biophys Res Commun 2005, 326, 298.
11. Prachayasittikul, V.; Isarankura Na Ayudhya, C.; Bulow, L. Biotech-
nol Lett 2001, 23, 1285.
12. Kostov, Y.; Albano, C. R.; Rao, G. Biotechnol Bioeng 2000, 70, 473.
13. Bae, J. H.; Paramita Pal, P.; Moroder, L.; Huber, R.; Budisa, N.
Chembiochem 2004, 5, 720.
14. Bae, J. H.; Rubini, M.; Jung, G.; Wiegand, G.; Seifert, M. H. J.;
Azim, M. K.; Kim, J.-S.; Zumbusch, A.; Holak, T. A.; Moroder, L.;
Huber, R.; Budisa, N. J Mol Biol 2003, 328, 1071.
15. Heim, R.; Prasher, D. C.; Tsien, R. Y. Proc Natl Acad Sci USA
1994, 91, 12501.
16. Das, A. K.; Hasegawa, J. Y.; Miyahara, T.; Ehara, M.; Nakatsuji, H.
J Comput Chem 2003, 24, 1421.
17. Patnaik, S. S.; Trohalaki, S.; Pachter, R. Biopolymers 2004, 75, 441.
18. Tozzini, V.; Nifosi, R. J Phys Chem B 2001, 105, 5797.
19. Yoo, H.-Y.; Boatz, J. A.; Helms, V.; McCammon, J. A.; Langhoff,
P. W. J Phys Chem B 2001, 105, 2850.
20. Donnelly, M.; Fedeles, F.; Wirstam, M.; Siegbahn, P. E.; Zimmer,
M. J Am Chem Soc 2001, 123, 4679.
21. Altoe’, P.; Bernardi, F.; Garavelli, M.; Orlandi, G.; Negri, F. J Am
Chem Soc 2005, 127, 3952.
22. Martin, M. E.; Negri, F.; Olivucci, M. J Am Chem Soc 2004, 126, 5452.
23. Voityuk, A. A.; Kummer, A. D.; Michel-Beyerle, M.-E.; Rosch, N.
Chem Phys 2001, 269, 83.
24. Laino, T.; Nifosi, R.; Tozzini, V. Chem Phys 2004, 298, 17.
25. Voityuk, A. A.; Michel-Beyerle, M.-E.; Rosch, N. Chem Phys 1998,
231, 13.
26. Patterson, G.; Day, R. N.; Piston, D. J Cell Sci 2001, 114, 837.
27. Cubitt, A. B.; Heim, R.; Adams, S. R.; Boyd, A. E.; Gross, L. A.;
Tsien, R. Y. Trends Biochem Sci 1995, 20, 448.
28. Jung, G.; Wiehler, J.; Zumbusch, A. Biophys J 2005, 88, 1932.
29. Wang, L.; Xie, J.; Deniz, A. A.; Schultz, P. G. J Org Chem 2003,
68, 174.
30. Sniegowski, J. A.; Phail, M. E.; Wachter, R. M. Biochem Biophys
Res Commun 2005, 332, 657.
31. Elsliger, M. A.; Wachter, R. M.; Hanson, G. T.; Kallio, K.; Remington,
S. J. Biochemistry 1999, 38, 5296.
32. Gery, S.; Koeffler, H. P. J Mol Biol 2003, 328, 977.
33. Yang, T. T.; Sina, P.; Green, G.; Kittis, P. A.; Chen, Y. T.;
Lybarger, L.; Chervenak, R.; Patterson, G. H.; Piston, D. W.; Kain,
S. R. J Biol Chem 1998, 273, 8212.
34. Sawano, A.; Miyawaki, A. Nucleic Acids Res 2000, 28, e78.
35. Follenius-Wund, A.; Bourotte, M.; Schmitt, M.; Iyice, F.; Lami, H.;
Bourguignon, J.-J.; Haiech, J.; Pigault, C. Biophys J 2003, 85, 1839.
36. Spartan’04. Wavefunction: Irvine, CA.
37. RECON, Version 5.5. Rensselaer Polytechnic Institute: Troy, New
York. Available at http://www.chem.rpi.edu/chemweb/recondoc.
38. E-DRAGON, Version 1.0. Virtual Computational Chemistry Labora-
tory. Available at http://www.vcclab.org.
39. Breneman, C. M.; Rhem, M. J Comput Chem 1997, 18, 182.
Table 11. Performance of ANN Models.
Model na rTRb rCV
c NANNd RMSTR
e RMSCVf
1g 19 0.9953 0.9795 9 � 1 � 1 4.3639 8.8237
2h 19 0.9919 0.9067 9 � 1 � 1 4.5785 15.7614
3i 29 0.9924 0.9335 9 � 8 � 1 2.7500 9.9095
4j 29 0.9840 0.9626 9 � 32 � 1 6.1641 9.7508
aNumber of chromophores in the data set.bTraining correlation coefficient.cCross-validated correlation coefficient.dNumber of nodes in input, hidden, and output layer of ANN.eRoot mean square error for training.fRoot mean square error for leave-one-out cross-validation.gGFP color variants (excitation maxima) data set.hGFP color variants (emission maxima) data set.iSynthetic GFP chromophore (excitation maxima) data set.jSynthetic GFP chromophore (emission maxima) data set.
1288 Nantasenamat et al. • Vol. 28, No. 7 • Journal of Computational Chemistry
Journal of Computational Chemistry DOI 10.1002/jcc
40. Bader, R. F. W.; Anderson, S. G.; Duke, A. J. J Am Chem Soc
1979, 101, 1389.
41. Whitehead, C. E.; Breneman, C. M.; Sukumar, N.; Ryan, M. D. J
Comput Chem 2003, 24, 512.
42. Tugcu, N.; Song, M.; Breneman, C. M.; Sukumar, N.; Bennett, K.
P.; Cramer, S. M. Anal Chem 2003, 75, 3563.
43. Nantasenamat, C.; Naenna, T.; Isarankura Na Ayudhya, C.; Pra-
chayasittikul, V. J Comput Aided Mol Des 2005, 19, 509.
44. Todeschini, R.; Consonni, V.; Mannhold, R.; Kubinyi, H.; Timmer-
man, H. Handbook of Molecular Descriptors; Wiley-VCH: Wein-
heim, 2000.
45. Thanikaivelan, P.; Subramanian, V.; Raghava Rao, J.; Unni Nair, B.
Chem Phys Lett 2000, 323, 59.
46. Karelson, M.; Lobanov, V. S.; Katritzky, A. R. Chem Rev 1996, 96, 1027.
47. Witten, I. H.; Frank, E. Data Mining: Practical Machine Learning Tools
and Techniques, 2nd ed.; Morgan Kaufmann: San Francisco, 2005.
48. Zupan, J.; Gasteiger, J. Neural Networks in Chemistry and Drug
Design, 2nd ed.; Wiley-VCH: Weinheim, 1999.
49. The Unscrambler, version 9.5. Camo Process AS: Norway.
50. Geladi, P.; Kowalski, B. R. Anal Chim Acta 1986, 185, 1.
51. Wold, S.; Sjostrom, M.; Eriksson, L. Chemometr Intell Lab 2001,
58, 109.
52. Haaland, D. M.; Thomas, E. V. Anal Chem 1988, 60, 1193.
53. UFS, version 1.8. University of Portsmouth, UK. Available at http://
www.port.ac.uk/research/cmd/software.
54. Whitley, D. C.; Ford, M. G.; Livingstone, D. J. J Chem Inf Comput
Sci 2000, 40, 1160.
55. Tibshirani, R. J.; Efron, B. Stat Appl Genet Mol Biol 2002, 1, 1.
56. Kearns, M.; Ron, D. Neural Comput 1999, 11, 1427.
57. Mason, L.; Bartlett, P.; Baxter, J. Technical Report; Department of
Systems Engineering, Australian National University, 1998. Avail-
able at http://citeseer.ist.psu.edu/mason98direct.html.
58. Mason, L.; Bartlett, P. L.; Baxter, J. Mach Learn 1999, 38, 243.
59. Valeur, B. Molecular Fluorescence: Principles and Applications;
Wiley-VCH: Weinheim, 2001.
60. Prendergast, F. G. Methods Cell Biol 1999, 58, 1.
61. Kummer, A.; Kompa, C.; Lossau, H.; Pollinger-Dammer, F.;
Michel-Beyerle, M. E.; Silva, C. M.; Bylina, E.; Coleman, W.;
Yang, M.; Youvan, D. Chem Phys 1998, 237, 183.
62. Zimmer, M. Chem Rev 2002, 102, 759.
63. Voityuk, A. A.; Michel-Beyerle, M. E.; Rosch, N. Chem Phys Lett
1998, 296, 269.
64. Weber, W.; Helms, V.; McCammon, J. A.; Langhoff, P. W. Proc
Natl Acad Sci USA 1999, 96, 6177.
65. Takashima, S.; Schwan, H. J Phys Chem 1965, 69, 4176.
66. Gilson, M. K.; Honig, B. H. Biopolymers 1986, 25, 2097.
67. Nakamura, H.; Sakamoto, T.; Wada, A. Protein Eng 1988, 2,
177.
68. Simonson, T.; Brooks, C. L. J Am Chem Soc 1996, 118, 8452.
69. Almeida, J. S. Curr Opin Biotechnol 2002, 13, 72.
1289Prediction of GFP Spectral Properties Using Artificial Neural Network
Journal of Computational Chemistry DOI 10.1002/jcc