QSBR study of bitter taste of peptides: Application of GA-PLS in combination with MLR, SVM, and ANN...

13
Hindawi Publishing Corporation BioMed Research International Volume 2013, Article ID 501310, 13 pages http://dx.doi.org/10.1155/2013/501310 Research Article QSBR Study of Bitter Taste of Peptides: Application of GA-PLS in Combination with MLR, SVM, and ANN Approaches Somaieh Soltani, 1 Hossein Haghaei, 2 Ali Shayanfar, 3 Javad Vallipour, 4 Karim Asadpour Zeynali, 5 and Abolghasem Jouyban 6 1 Biotechnology Research Center and Faculty of Pharmacy, Tabriz University of Medical Sciences, Tabriz 51664, Iran 2 Hematology and Oncology Research Center, Tabriz University of Medical Sciences, Tabriz 51664, Iran 3 Liver and Gastrointestinal Diseases Research Center, Students’ Research Committee, Tabriz University of Medical Sciences, Tabriz 51664, Iran 4 Tuberculosis and Lung Disease Research Center, Tabriz University of Medical Sciences, Tabriz 51664, Iran 5 Faculty of Chemistry, University of Tabriz, Tabriz 51664, Iran 6 Drug Applied Research Center and Faculty of Pharmacy, Tabriz University of Medical Sciences, Tabriz 51664, Iran Correspondence should be addressed to Abolghasem Jouyban; [email protected] Received 29 April 2013; Revised 16 September 2013; Accepted 25 September 2013 Academic Editor: Tatsuya Akutsu Copyright © 2013 Somaieh Soltani et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Detailed information about the relationships between structures and properties/activities of peptides as drugs and nutrients is useful in the development of drugs and functional foods containing peptides as active compounds. e bitterness of the peptides is an undesirable property which should be reduced during drug/nutrient production, and quantitative structure bitter taste relationship (QSBR) studies can help researchers to design less bitter peptides with higher target efficiency. Calculated structural parameters were used to develop three different QSBR models (i.e., multiple linear regression, support vector machine, and artificial neural network) to predict the bitterness of 229 peptides (containing 2–12 amino acids, obtained from the literature). e developed models were validated using internal and external validation methods, and the prediction errors were checked using mean percentage deviation and absolute average error values. All developed models predicted the activities successfully (with prediction errors less than experimental error values), whereas the prediction errors for nonlinear methods were less than those for linear methods. e selected structural descriptors successfully differentiated between bitter and nonbitter peptides. 1. Introduction Proteins are made from peptide fragments that are well known for their nutrient, biological, and physiological roles in the human body. Peptides modulate the health-connected physiological process of the cardiovascular, nervous, im- mune, and nutritional systems [1]. e investigation of prop- erties and activities of peptides as therapeutic, bioactive agents and nutrients as well as starting points for the develop- ment of drugs and drug-related compounds is one of the most interesting and demanding fields of food and drug sciences. It allows researchers to compile data sets on their structures and properties/activities. e results of these studies are useful in the development of functional foods containing peptides as active compounds and drugs [2]. Peptide bitterness is an undesirable property that is frequently generated during the enzymatic process to produce functional, bioactive protein hydrolysates or during the aging process in fermented food products [3]. Since many toxins are bitter, most mammalians including humans are instinctively averse to bitter-tasting substances in order to avoid toxin ingestion [4]. Most ther- apeutic peptides cannot be administered orally because of the poor biopharmaceutical performance of high-molecular- weight peptide drugs which is due to poor oral absorption,

Transcript of QSBR study of bitter taste of peptides: Application of GA-PLS in combination with MLR, SVM, and ANN...

Hindawi Publishing CorporationBioMed Research InternationalVolume 2013 Article ID 501310 13 pageshttpdxdoiorg1011552013501310

Research ArticleQSBR Study of Bitter Taste of PeptidesApplication of GA-PLS in Combination withMLR SVM and ANN Approaches

Somaieh Soltani1 Hossein Haghaei2 Ali Shayanfar3 Javad Vallipour4

Karim Asadpour Zeynali5 and Abolghasem Jouyban6

1 Biotechnology Research Center and Faculty of Pharmacy Tabriz University of Medical Sciences Tabriz 51664 Iran2Hematology and Oncology Research Center Tabriz University of Medical Sciences Tabriz 51664 Iran3 Liver and Gastrointestinal Diseases Research Center Studentsrsquo Research Committee Tabriz University of Medical SciencesTabriz 51664 Iran

4Tuberculosis and Lung Disease Research Center Tabriz University of Medical Sciences Tabriz 51664 Iran5 Faculty of Chemistry University of Tabriz Tabriz 51664 Iran6Drug Applied Research Center and Faculty of Pharmacy Tabriz University of Medical Sciences Tabriz 51664 Iran

Correspondence should be addressed to Abolghasem Jouyban ajouybanhotmailcom

Received 29 April 2013 Revised 16 September 2013 Accepted 25 September 2013

Academic Editor Tatsuya Akutsu

Copyright copy 2013 Somaieh Soltani et alThis is an open access article distributed under theCreative CommonsAttribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Detailed information about the relationships between structures and propertiesactivities of peptides as drugs andnutrients is usefulin the development of drugs and functional foods containing peptides as active compounds The bitterness of the peptides is anundesirable property which should be reduced during drugnutrient production and quantitative structure bitter taste relationship(QSBR) studies can help researchers to design less bitter peptides with higher target efficiency Calculated structural parameterswere used to develop three different QSBR models (ie multiple linear regression support vector machine and artificial neuralnetwork) to predict the bitterness of 229 peptides (containing 2ndash12 amino acids obtained from the literature)Thedevelopedmodelswere validated using internal and external validation methods and the prediction errors were checked using mean percentagedeviation and absolute average error values All developed models predicted the activities successfully (with prediction errors lessthan experimental error values) whereas the prediction errors for nonlinear methods were less than those for linear methods Theselected structural descriptors successfully differentiated between bitter and nonbitter peptides

1 Introduction

Proteins are made from peptide fragments that are wellknown for their nutrient biological and physiological rolesin the human body Peptides modulate the health-connectedphysiological process of the cardiovascular nervous im-mune and nutritional systems [1] The investigation of prop-erties and activities of peptides as therapeutic bioactiveagents and nutrients as well as starting points for the develop-ment of drugs and drug-related compounds is one of themostinteresting and demanding fields of food and drug sciences Itallows researchers to compile data sets on their structures and

propertiesactivities The results of these studies are usefulin the development of functional foods containing peptidesas active compounds and drugs [2] Peptide bitterness is anundesirable property that is frequently generated during theenzymatic process to produce functional bioactive proteinhydrolysates or during the aging process in fermented foodproducts [3] Since many toxins are bitter most mammaliansincluding humans are instinctively averse to bitter-tastingsubstances in order to avoid toxin ingestion [4] Most ther-apeutic peptides cannot be administered orally because ofthe poor biopharmaceutical performance of high-molecular-weight peptide drugs which is due to poor oral absorption

2 BioMed Research International

formulation stability and degradation in the gastrointestinaltract Studies on the origins of formulations and alternativeadministrations to overcome the mentioned problems havesuggested different administration methods such as par-enteral oral transdermal nasal pulmonary rectal ocularbuccal and sublingual drug delivery systems [5ndash8] Tasteplays a crucial role in buccal and sublingual administrationsystems Bitter taste properties in relation with the structureof the peptides in fermented food and protein hydrolyzateshave been studied Findings have shown that hydrophobicityis correlated with bitterness and a hydrophobic interactionis needed for the bitter receptors (T

2Rs) to sense bitter-

ness whereas the amino acid sequence has no effect onbitterness [9 10] Moreover introducing amino acids intothe hydrophobic chain intensifies bitterness and blockingboth C and N terminals of peptides by acetylating increasesbitterness about ten times [4] It is now generally acceptedthat the side-chain hydrophobicity and the number of carbonatoms of the hydrophobic side chain of the peptidersquos aminoacids are correlated to bitterness rather than to overallhydrophobicity [4 9 11 12] In fact the hydrophobic group ofthe side chain offers a binding site for the bitter taste receptorAnother binding site is a bulky basic group including an 120572-amino group To provide such features hydrophobic aminoacids at the C-terminal and basic amino acids at the N-terminal are necessary The bitter peptides are composed ofless than eight amino acids and the bitterness increases asthe number of amino acids increases Peptides composed ofeight or more amino acids do not differ in bitter potency andthey form a spherical shape rather than a helix conformation[4 9]

Beyond structure-activity studies researchers have triedto develop quantitative models to predict the bitterness ofpeptides [3] and to evaluate the effect of the primary structureon the potency of therapeutic peptides [1 13] These studieswere aimed at understanding the peptide structures respon-sible for different activities (taste ACE inhibition antithrom-botic opioid and endopeptidase inhibitory antimicrobialactivities and immune modulator) and to accelerate thestudies about food-derived functional and bioactive peptides

Different quantitative structure bitter taste relationship(QSBR) models using various descriptors have been devel-oped for di- and tripeptides [1 3 13ndash16] the study of tetra-and higher peptides is limited [3] Some methods developedto predict the bitterness of dipeptides used amino acid-based descriptors which resulted in models with acceptableprediction capabilities

Yin et al [17] reported 28 developed models in the year2010 for modeling dipeptide bitterness in comparison withtheir own model which used E

1ndashE5amino acid variables

(hydrophobicity (E1) steric properties or side chain bulk

molecular size (E2) preferences for amino acids to occur in

120572-helices (E3) composition (E

4) and the net charge (E

5))

to develop QSBR models using support vector regression(SVM) Their model [17] successfully predicted (1198772 = 097)the bitterness of 48 dipeptides (1198772 values calculated using(A6) of appendix) Table 1 (28 models taken from a reference[17] + 2 other models) summarizes the developed modelsfor dipeptides along with their prediction errors (root mean

Table 1 Details of previously developed QSBR models

No Descriptors Model 1199022 (CV) 1198772 RMSE Ref1 MS-WHIM PLS 0633 0704 ndlowast [18]2 ISA-ECI PLS Nd 0847 nd [19]3 119911-scales PLS 0713 0824 026 [20]4 GRID PLS 0780 nd nd [21]5 MHDV PCR 0864 0919 018 [22]6 MEEV (M3) MLR 0588 0773 034 [23]7 MEEV (M4) MLR 0677 0734 034 [23]8 SSIA-AM1 PLS 0837 085 025 [24]9 SSIA-PM3 PLS 0829 0888 022 [24]10 SSIA-HF PLS 0798 0844 025 [24]11 SSIA-DFT PLS 0741 0856 024 [24]12 MARCH-INSIDE MLR 0860 0881 023 [25]13 Constitutional MLR 0820 0846 026 [25]14 Topological MLR 0890 091 020 [25]15 Molecular MLR 0580 0618 039 [25]16 BCUT MLR 0720 0783 030 [25]17 Galvez MLR 0560 0617 040 [25]18 2D MLR 0710 0753 032 [25]19 Randic MLR 0512 0559 042 [25]20 Geometrical MLR 0895 0909 020 [25]21 RDF MLR 0814 0851 025 [25]22 3D-MoRSE MLR 0880 0914 020 [25]23 GETAWAY MLR 0857 0889 022 [25]24 WHIM MLR 0799 0861 025 [25]25 SZOTT PLS 0736 0908 020 [26]26 VSW PLS 0696 0868 024 [16]27 V MLR 0921 0948 017 [15]28 E MLR 0888 0940 021 [27]29 E SVR 0912 0962 012 [17]30 119911-scores PLS 0800 0850 nd [3]lowastNo data

square errors (RMSE) which were calculated using (A1) ofthe appendix)

Kim and Li-Chan (2006) [3] studied the QSBR of 224di- to tetradecapeptides (2ndash14 amino acids peptides) usingmultiple linear regression (MLR) and partial least square(PLS) regression methods They used total hydrophobicitymolecular mass (log119872) and residual numbers to developtheir regressionmodels and the amino acid 119911-scores namely1199111 hydrophobicity 119911

2 bulkinessmolecular size and 119911

3 elec-

tronic property were calculated using the method developedby Hellberg et al [20] (individually or in combination withthe mentioned parameters) to develop a PLS model (Table 1number 30) They concluded that the developed modelsfor whole data sets and data subsets (comprising differentpeptide length) were significant and improved for subsets [3]Moreover the combination of 119911-scores and studied variablesimproved the correlation of predicted and observed values fordi- and tripeptides

QSBR model development for three or more amino acidpeptides has not been studied as well as that of dipeptides

BioMed Research International 3

This work aims to build suitable models for predicting thebitterness of 224 peptides and 5 amino acids on the basis oftheir calculated structural parameters Structural parameters(0Dndash3D) were calculated using Dragon [28] software Theselected parameters were used to develop linear and nonlin-ear models and the prediction capability and robustness ofthe proposedmodels were studied using internal and externalvalidation methods according to the QSAR method valida-tions guidelines [29] and references [30ndash33] The impacts ofthe selected parameters and structural features on bitternesswere evaluated according to the parameter definitions

2 Materials and Methods

21 Data Set A total of 229 experimental bitterness values(224 peptides and 5 amino acids) determined by humansensory evaluations were obtained from [3] The details ofpeptide sequences and the bitterness activity expressed as alog(1119879) (119879 being the bitter threshold concentration (119872)) aresummarized in Table 2 The ranges of bitterness were 10ndash57and the numbers of amino acids of the studied peptides were2ndash14

22 Calculated Descriptors The 2D structures of all mole-cules were drawn and converted to 3D structures usingthe HyperChem 7 software The completed model and themolecularmechanics energyminimizedmolecules were usedas inputs for the Dragon 54 software [28] The softwarecalculated 20 subsets of molecular descriptors including2D autocorrelation 3D-MoRSE descriptors centered frag-ments Burden eigenvalues connectivity indices constitu-tional descriptors edge adjacency indices eigenvalue-basedindices functional group counts geometrical descriptorsGETAWAYdescriptors information indicesmolecular prop-erties Randic molecular profiles RDF descriptors topolog-ical charge indices topological descriptors walk and pathcounts and WHIM descriptors The structural parameterscalculated after discarding the constant and near-constantvalues (1295 descriptors) were saved and further analyzedusing the SPSS 115 STATISTICA 7 and MATLAB 78 soft-ware

23 Outlier Detection In order to identify possible out-liers two different methods of principal component analysis(PCA) mapping and standard scores were used Accordingto the nature of the outliers which could be related tothe different mechanism of binding because of the differentstructural features recording errors or inaccurate design ofsamples no single outlier detection approach could identifyall kinds of outliers [34] In this study two approaches onein descriptor space (PCA plot of scores) and the secondin response space (standard score) were used to check theoutliers before the main numerical analysis [35]

24 Training-Test Selection The training set plays an impor-tant role in developing the properties of the model as themore similar the molecules for training the model are themore accurate the expected results are Thus the selection

Table 2 Details of the observed-predicted (using MLR SVM andANNmethods) bitterness and corresponding errors (IPD)

Peptide sequenceObserved MLR IPD SVM IPD ANN IPDTraining data set

R 160 120 251 181 134 190 185F 170 186 93 183 75 186 95P 190 187 16 236 243 216 135L 170 182 69 182 73 186 94V 170 162 50 161 53 168 11GR 100 200 1001 231 1307 206 1060RR 211 238 129 269 274 284 347PP 234 205 124 235 02 226 33KP 252 239 50 328 301 320 268PR 252 197 217 326 293 322 279RF 260 268 30 281 81 276 60RP 310 234 245 305 15 303 21LI 240 207 138 212 115 201 164KF 204 246 208 183 105 184 100VF 252 246 25 232 80 236 63VY 252 244 31 274 88 277 99YG 252 237 61 248 17 239 53YY 263 282 72 247 60 247 60FI 283 262 76 232 180 206 272IF 283 254 102 358 266 352 244YF 310 270 129 331 68 329 62VA 116 163 409 155 339 149 286VG 119 134 129 156 308 136 143PA 132 162 225 159 205 155 172IE 137 213 555 230 682 212 547IQ 149 191 283 197 325 193 297IS 149 180 207 167 124 180 207IT 149 211 413 208 396 203 362SL 149 199 337 214 438 192 287WE 156 263 686 270 730 265 696IK 165 240 455 242 465 233 413IA 168 177 55 164 23 166 11AL 170 152 109 178 47 177 38VV 171 159 69 170 05 163 48LA 172 184 69 172 01 176 23PY 180 209 159 193 73 202 121GW 189 218 155 200 57 215 139PL 222 198 109 221 02 195 120PI 233 185 205 210 99 192 176IP 240 206 141 225 63 203 153YL 240 252 52 261 89 255 61LY 246 279 134 289 173 276 122IW 305 284 69 286 63 276 95FY 313 280 104 290 75 284 92LW 340 263 226 280 178 286 160IV 190 180 53 231 215 206 82FV 223 255 144 229 28 206 74VE 223 199 106 257 154 246 103

4 BioMed Research International

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDYP 170 251 478 172 10 157 79PK 223 206 78 227 17 210 59LD 223 219 17 226 13 209 62AD 223 215 37 222 05 200 105LE 252 215 146 233 75 213 155GE 283 213 248 263 72 258 88VL 211 204 32 247 169 237 124GP 179 228 273 209 167 194 86FG 200 224 121 225 125 226 128PF 214 228 64 234 93 229 69GF 236 225 45 230 25 224 52GY 215 214 04 212 15 217 08LF 282 252 105 270 42 255 94FL 285 258 94 264 74 257 97GL 164 180 96 170 40 175 66LG 171 170 07 158 77 170 06IG 201 171 151 169 160 170 156GI 217 191 121 182 163 184 154LL 247 221 105 233 56 208 160II 254 205 191 224 119 205 191IL 254 212 166 229 99 208 182RGP 190 251 318 166 124 183 39FGG 234 227 32 210 103 209 109GFG 252 235 67 225 109 210 165PPP 270 239 116 262 31 242 103GLL 283 237 164 233 176 207 267RPG 310 271 127 246 205 237 235LGG 100 168 676 175 750 167 671GGL 200 199 04 200 02 192 41LGL 230 238 33 243 58 234 17LLG 230 235 21 239 40 231 03PIP 285 237 167 371 301 351 232LLL 292 270 75 306 48 294 09GYG 170 172 13 263 545 242 422GVV 234 205 124 298 273 307 313VVV 234 238 18 176 249 164 298PPF 263 248 57 338 284 338 285YGG 263 260 13 267 16 246 63RPF 283 328 159 255 98 244 138FGF 292 285 24 324 111 329 125DLL 310 318 26 263 152 267 137GFF 323 304 59 314 28 308 47FPF 340 331 28 338 06 325 44GYY 340 302 112 315 74 306 99PFP 340 288 153 274 195 258 242YYY 370 345 68 373 07 357 35GGV 148 193 302 175 184 179 210PGG 234 169 276 202 136 200 146GGF 283 242 144 239 154 236 168GGP 204 195 44 260 273 251 231

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDPPG 204 222 86 235 154 216 57FFG 265 293 105 306 155 299 128ELL 340 291 145 273 197 282 170FFF 370 344 71 428 157 394 64EGG 283 206 272 271 42 261 79YGY 310 284 85 287 74 279 100YPF 352 303 139 318 96 304 137GGLG 160 192 198 155 33 134 162FGFG 352 294 164 355 10 356 12GPFF 380 321 156 319 160 315 170RPGF 380 339 109 341 102 336 116FGGF 392 308 215 332 153 32 183RPFF 440 382 132 442 04 424 36GLGG 170 184 82 170 02 174 23LGGG 190 198 41 226 189 192 08PFPP 234 314 343 239 19 241 28GPPF 252 288 142 256 17 247 19RRPP 27 314 164 300 112 311 153VYPF 352 346 16 297 157 305 135PFIV 352 312 115 328 69 333 55FFPR 400 387 32 379 52 387 33FFPP 252 327 296 251 03 243 35FFPE 276 356 288 349 263 355 286GGFF 285 348 220 239 161 234 181FFPG 290 338 166 337 163 335 156LLLL 323 335 37 318 16 323 00FFGG 252 320 270 241 44 231 82RRPFF 470 415 117 469 03 463 14GGGLG 190 228 199 234 231 208 94GLGGG 190 215 129 177 70 169 108LGGGG 190 216 136 251 323 249 311GGVVV 211 284 346 221 47 202 45FFPGG 283 358 266 318 124 318 122PGPIP 311 325 46 326 48 319 26RGPPF 263 345 312 295 122 289 98PPFIV 292 330 129 264 96 269 79RPGFF 351 382 88 398 134 386 99RPGGFF 404 395 23 428 58 406 06PFPGPI 336 363 81 365 87 356 60RPFFGG 392 411 47 422 76 407 39RGPPGF 352 331 59 355 09 343 26RGPPFI 460 360 217 347 247 352 234RGPFIV 430 384 107 383 109 376 126RGGFIV 310 335 80 302 26 289 67RGPPFF 423 404 46 413 24 398 59RRPPGF 440 373 153 479 90 475 79RRPPFF 515 413 199 508 14 540 49GPPFIV 292 340 166 328 124 320 95GGFFGG 370 402 88 372 04 355 40KPPFIV 382 356 68 342 104 351 82

BioMed Research International 5

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDGGRPFF 404 407 06 398 15 383 51PVLGPV 330 325 16 326 12 321 28FPPFIV 352 366 40 418 188 409 163RGPPGGV 248 307 239 293 182 296 192RGPPGFF 440 397 97 404 83 393 107RGPPFFF 500 445 111 462 76 447 106VIIPFPG 360 408 133 349 31 35 28VIFPPGR 410 433 55 387 56 383 66RGPPGIG 278 349 255 341 227 343 235RGPPGGF 308 325 54 317 30 310 08YPFPGPI 380 409 77 421 107 405 66RPPPFFF 470 473 06 441 62 422 102VIPFPGR 415 413 05 438 56 423 19PFPGPIP 360 354 16 400 112 398 107RGPPGFG 368 356 31 347 57 349 52RGPFPIV 395 385 25 392 07 388 19RPFFRPFF 500 511 22 496 09 503 06RGPKPIIV 408 387 50 421 32 383 63VYPFPPGI 382 423 107 421 101 415 87RGPEPIIV 451 387 142 412 87 380 158RGPPGGFF 411 337 180 328 201 328 203GGRPFFGG 440 430 23 392 108 381 133RGPPGGGFF 395 405 26 424 74 409 35GGRGPPFIV 410 395 36 435 61 430 49RGPPFIVGG 431 406 57 404 63 401 71FFRPFFRPFF 515 549 67 421 182 406 211PVRGPFPIIV 540 416 229 482 108 418 226VYPFPPGINH 430 448 43 451 49 438 20VYPFPPGIGG 352 420 192 308 124 300 149VYPFGGGINH 364 396 87 401 101 388 66RPFFRPFFRPFF 500 544 89 504 08 522 44RGPPFIVRGPPFIV 440 492 118 387 120 375 148PVLGPVRGPFPIIV 483 436 97 511 59 448 72

Test data setRG 211 197 64 229 84 201 46AV 116 171 476 159 374 158 363ID 137 207 514 207 512 197 439VD 190 194 22 236 245 215 134LV 223 189 151 175 216 182 184VI 223 190 147 199 109 196 121FP 277 254 83 265 43 246 111FF 301 275 86 289 40 282 65AF 181 241 334 255 407 239 318PGR 160 265 654 260 623 265 659RRR 240 355 480 352 468 355 478FIV 283 277 21 268 52 271 42GGY 283 244 138 255 97 245 135GRP 310 264 148 254 181 264 150YYG 320 310 32 317 09 311 29KPF 340 318 64 314 76 311 85GLG 200 158 213 157 215 169 157

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDGPG 170 209 232 211 240 201 181KPK 252 297 177 282 118 302 199VYP 252 297 177 287 139 290 150GGGL 234 226 32 239 21 210 100RPFG 341 347 18 337 11 334 19RGFF 380 369 30 395 40 383 07GGLGG 190 229 206 193 17 192 09GGGGL 265 221 166 240 95 217 179PGPGPG 260 310 192 314 206 309 190PFPIIV 390 343 120 355 90 340 127VIFPPG 268 380 418 371 384 366 366RGPPFIV 430 386 103 375 127 378 121VYPFPPG 352 406 154 409 161 395 122RGPFPIIV 540 414 233 440 185 414 234RGPGPIIV 481 384 203 426 115 388 192RRPPPFFF 570 478 162 476 165 469 176VIIPFPGR 385 454 180 447 161 451 171VYPFPPIGNH 430 446 37 448 41 438 18GGRGPPFIVGG 440 421 43 421 44 411 66

Validation data setIN 149 211 419 212 424 201 348WW 360 299 169 318 117 305 153GV 174 222 278 219 258 208 198FPK 252 310 229 294 167 314 247FPP 234 265 133 256 96 262 121PGP 204 256 254 257 260 241 180VIF 289 278 38 279 36 282 23RPPFIV 410 374 87 361 120 365 110RGPPFGG 323 337 45 323 01 334 33RGPPFIIV 430 412 41 420 24 399 72

of the training and test sets is one of the most importantsteps in model development and should be done so thateach set reflects the original data set as much as possible 119870-means clustering is one of the most frequently used methodsof data set splitting This procedure identifies relativelyhomogeneous groups of molecules (each observation has thenearest distance to the mean of the cluster) based on selectedproperties (biologic activities and structural parameters)Using this method we divided the data set into 10 clustersand three subsets (ie training test and validation sets) wereselected from themThe validation set was excluded from thestudy before the descriptor selection step and the test set wasexcluded before the model development step

25 Descriptor Selection In order to reduce the dimensionof the variable matrix a correlation analysis was carriedout and both the highly correlated and constant variableswere excluded A home-developed MATLAB toolbox wasemployed that considered the following criteria for excludinga variable [1] the intercorrelation with other descriptorshigher than 099 [2] the correlation with activity lower than

6 BioMed Research International

Bitterness

Total descriptors

(1) Removing constantdescriptors

(2) Removing highlycorrelated descriptors

Descriptors Descriptors

Descriptors

Training test andvalidation sets selection

using K-means clustering

Fina

l sele

cted

de

scrip

tors

Stepwise regressiondescriptor selection G

A-PL

S se

lect

edde

scrip

tors

GA-PLS descriptorselection

Model development(1) MLR(2) SVM(3) ANN

Prediction of thebitterness for training test

and validation sets

Validation of thedeveloped modelsaccording to the

guidelines

Figure 1 Procedure of descriptor selection QSBR model development and validation processes

intercorrelated descriptors and [3] the frequency of repeatedvalues for a descriptor (lower than 10 of cases)

After this step a genetic algorithm-partial least square(GA-PLS) algorithm [36] was used to select the most signif-icant variables GA-PLS (a combination of genetic algorithmand PLS regression) was also developed and utilized as avariable selection method in QSAR and QSPR studies byLeardi [36] and applied for QSAR studies [37 38] TheMATLAB 78 software was used to run the GA-PLS methoddeveloped by Leardi The variables were divided into sub-groups (containing up to 200 variables) and each subgroupwith the corresponding log 1119879 values (119884) was introduced tothe algorithm as input The output (containing descriptorsscoring based on the cross-validated percentage of explainedvariance) was produced after 100 runs As the results of theGA-PLS were a bit different for each run the top 50 scoresof each subgroup were combined with the next 100 variablesand this step was repeated up to the final step A comparativestudy was done in this study to check the capability of GA-PLS for descriptor selection in comparison with commonmethods (ie PLS and stepwise) Results showed that GA-PLS was able to extract more significant descriptorsThe onlydrawback of this method was the scoring differences betweendifferent runs which could be solved using the mentionedprocedure (ie selection of 50 of each run instead of limiteddescriptors [37])

Twenty percent of high score variables of the final stepwere selected as the most significant variables and were fur-ther studied by stepwise regression and bivariate correlationanalysis in which the selected variables using stepwise regres-sionwere investigated concerning their intercorrelationsTheless intercorrelated variables were selected for the final modeldevelopment process The complete procedure of descriptorselection is summarized in Figure 1

26 Model Building Linear and nonlinear models weredeveloped using the selected descriptors The details of themodel development and validation process are discussed inthe next sections

27 Linear Model Using Multiple Linear Regression (MLR)The selected parameters were used for developing QSBRequations using the multiple linear regression correlation(MLR) method and the goodness of fit and statistical sig-nificance of the models were evaluated using 1198772 (coefficientof determination) 119865 (variance ratio) and the MPD (meanpercentage deviation) values calculated using

MPD = 100119873

sum

100381610038161003816100381610038161003816100381610038161003816

119884pred minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(1)

The relative frequency of individual percentage deviations(IPD) was studied in order to define the model predictioncapacity for each data point The IPD was calculated using

IPD = 100100381610038161003816100381610038161003816100381610038161003816

119884pred minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(2)

In order to compare the MPD values with the relativestandard deviations (RSD) between experimental bitter val-ues measured by different research groups (inter laboratoryrelative standard deviations (ILRSD)) the ILRSD values forthe available data were computed using

ILRSD = 100119873

100381610038161003816100381610038161003816100381610038161003816

119884exp minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(3)

BioMed Research International 7

28 Nonlinear Models Using Support Vector Regression (SVR)and Artificial Neural Network (ANN) In the next stage theselected descriptors were used to derive non-linear modelsusing SVR and ANN To construct an ANN the Levenberg-Marquardt algorithm [39] was used by nftool toolbox ofMATLAB 78 software to train the network Selected descrip-tors and the bitter activity of training sets were introducedas input and output values respectively The training set wasrandomly classified for training validation and test sets inorder to avoid overfitting and then networks were trainedSVR another non-linear model constructs a hyperplane ina multidimensional space that provides minimum error byemploying a non-linear Kernel function Some parameterssuch as capacity parameter (119862) 120576 are related to type of noisein the data and 120574 is related to radial base function (RBF)which is the most common type of Kernel functions TheSVMmodel and optimization of parameters were done usingSTATISTICA 7 software

29 Model Validation The developed models were evaluatedusing the leave-many-out (LMO) cross-validation methodThe 1199022 values were calculated using (A8) of the appendix

210 Chance Correlation To check the possibility of a chancecorrelation 10 times shuffled activities were correlated tothe variables and the produced regression coefficients werecompared with the developed model regression coefficient

211 External Validation of Proposed Models Using the Exter-nal Test Set A set of statistical criteria [30 31] was employedto analyze 36 data points of the test set

(1) 1198772 gt 06

where 1198772 is the coefficient of determination between thepredicted and observed values

(2) (1198772 minus 11987720)1198772lt 01 and 085 le 119870 le 115 or (1198772 minus

11987710158402

0)1198772lt 01 and 085 le 1198701015840 le 115

where 11987720is the coefficient of determination obtained using

predicted values relative to a regression line fit for experimen-tal values and required to pass through the origin and 11987710158402

0

is the corresponding coefficient obtained using experimentalvalues relative to a regression line fit for predicted values andrequired to pass through the origin 119870 and 1198701015840 are the slopesof regression lines through the origin for fits for experimentaland predicted data respectively

(3) |11987720minus 11987710158402

0| lt 03

In addition another criterion proposed by P P Roy andK Roy [32] was considered as

(4) 1198772119898= 1198772(1 minus radic119877

2minus 1198772

0)

in which 1198772119898gt 05 indicates the good external predictability

of the QSAR models

3 Results and Discussion

31 Outlier Detection According to standard score analysisthere is no outlier The PCA map of scores showed that mostof the data points fell into the acceptable data space and thereis no outlier in the studied data set

32 Training-Test Selection The selected training test andvalidation data sets are listed in Table 2 along with thecomputed accuracy criteria The activity ranges for trainingtest and validation sets were 100ndash540 116ndash570 and 149ndash540 respectivelyThe number of data points for the trainingtest and validation sets was 181 36 and 10 respectively

33 Descriptor Selection A total of 1292 descriptors were cal-culated using Dragon 54 softwareThis number was reducedto 244 descriptors after correlation analysis was performedusing a home-developed toolbox The remaining descriptorswere analyzed using the GA-PLS method and the best scor-ing descriptors (descriptors included inTable 3)were selectedfor further analysis using stepwise regression and bivariatecross-correlation studies (the intercorrelation between theselected descriptors was less than 09) After excluding thenonsignificant descriptors the remaining descriptors werechecked to find the possible intercorrelation and the finaldescriptors were selected using a stepwise regression method(see Table 4) The final six selected descriptors were againchecked for intercorrelation Six descriptors are suitable forthe construction of a QSAR model [40]

34 Evaluation of the Selected Descriptors SPAN belongsto the size descriptors which evaluate the dimension ofthe molecule and often calculate it from the moleculargeometry This descriptor is a suitable size descriptor formacromolecules [41] and its selection for the studied dataset with the developed method showed the method to besuitable for descriptor selectionThe correlation study results(Figure 2) showed that bitterness increased upon theenhancement of the SPAN (Figure 2(a)) This finding is inagreement with previous findings [1ndash13] which suggested asignificant correlation between peptide size and bitternessThe investigation of the correlation of peptide subsets withthis descriptor showed less correlations in comparison withthe whole dataset Therefore SPAN can be regarded as ageneral descriptor rather than a specific one for peptidesubsets Thus SPAN can be used in the primary steps ofpeptide bioactive compound discovery to recognize bittercompound fractions

Mean square distance (MSD) index (Balaban) [28 41]contributes negatively to bitterness (Figure 2(b)) MSD de-creases with the increasing of molecular branching in anisomeric set [41] Furthermore it decreases with an increasein the number of atomsThis is in agreementwith the findingson the increase in bitterness following an increase in thenumber of amino acids in the studied peptides The lowercorrelation coefficient between peptide subsets by MSD incomparison with the total dataset showed that MSD isa general bitterness indicator rather than a subset bitternessdescriptor

8 BioMed Research International

Table 3 GA-PLS selected descriptors along with descriptive fortraining data set (181 peptides)

Name Range Minimum Maximum Mean Std deviationrdf025p 11270 407 11677 3090 2113mor15u 381 minus190 191 018 065eeig08x 421 minus024 397 216 121mor11m 213 minus121 092 minus015 040rdf025v 10821 400 11220 2952 2028mor15e 373 minus157 216 026 070mor21m 386 minus393 minus008 minus090 064belp7 188 000 188 118 042rdf075m 14442 000 14442 1628 2418mor31p 185 014 198 054 036ggi6 387 000 387 077 074mor11v 254 minus100 155 010 040mor31v 159 011 170 045 031ncs 2900 100 3000 726 550ti1 621722 minus8725 612997 25545 69986e3s 064 000 064 021 010gmti 131885700 34700 131920400 8106404 17363751vep1 818 261 1078 528 163ats1p 288 181 468 312 060smtiv 77560100 34100 77594200 4840738 10318764behv6 284 096 379 282 053alogp 807 minus281 526 038 145rtp a 2563 490 3053 1302 531c002 1800 000 1800 413 338j3d 1070 241 1310 535 246idmt 142611569 18231 142629800 7586086 18440059mats2m 032 minus016 016 004 006mor11p 280 minus129 151 005 043mats2v 142611569 18231 142629800 7586086 18440059hats8u 070 000 070 033 013smti 47759800 16800 47776600 2905336 6331178mor05u 3643 minus3823 minus180 minus1014 651hats8e 068 000 068 034 013mats2e 031 minus015 015 003 006l1u 5345 241 5586 1298 933rbn 4600 100 4700 1161 803h6m 181 000 181 034 036rdf080e 38799 000 38799 3941 6182rtv a 2323 399 2721 1150 482

Bitterness increases with the enhancement of E3s(Figure 2(c)) E3s is one of the accessibility directionalWHIM indiceswhich isweighed by atomic electrotopologicalstates [28 41] Indeed this descriptor defines size shapepolarizability and the conformational properties of the stud-ied molecules together Its correlation with the bitterness isweaker than MSD and SPAN descriptors but its removalfrom the model decreases the correlation coefficient signif-icantly It is not a strong identifier for utilization in drugdevelopment processes

Table 4 Intercorrelation of final descriptors

log(1119879) SPAN Mor11v MSD HATS8u G3p E3slog(1119879) 100SPAN 079 100Mor11v 034 009 100MSD minus081 minus078 minus020 100HATS8u minus056 minus052 minus025 038 100G3p minus063 minus064 minus013 064 030 100E3s 041 022 031 minus038 minus016 minus025 100

G3p the 3rd-component symmetry directional WHIMindex (weighed by polarizability) [28 41] showed a negativecorrelation with bitterness (Figure 2(d)) Along with E3sthese descriptors contribute to the electrical properties of themolecule G3p decreases with the increase in the number ofamino acids In fact more bitter compounds possess largerG3p values Similar to the SPAN and MSD descriptors thecorrelations with subsets are less than the general datasetand this descriptor can be regarded as a general identifierHATS8u (Figure 2(e)) belongs to the GETAWAY descriptorsThese descriptors are molecular descriptors derived fromthe molecular influence matrix (MIM) [12 28] HATSindices known as spatial autocorrelation based descriptorsencode information nonstructural fragments [28 41] Theyare suitable for describing differences in a congeneric seriesof molecules The effective position of substituent andfragments in molecular space and information about themolecular size and shape can be encoded by unweightedHATS indices Moreover they are independent of moleculealignment

In summary GETAWAY descriptors encoded local infor-mation while WHIM descriptors related to the holisticinformation of the molecules thus their joint uses areadvised HATS8u decreases with an increase in molecularsize The negative relation of this descriptor with bitterness isin agreement with the findings about the size and bitternessrelation

3D-MoRSE descriptors (3D molecule representation ofstructures based on electron diffraction) are derived frominfrared spectra simulation using a generalized scatteringfunction [28] Mor11v weighed by van der Waals volumebelongs to these descriptors which can be regarded as anindicator of size mass and volume of the molecules Bydecreasing the size of the peptide the Mor11v values tendtoward zero (Figure 2(f)) In fact the absolute values ofMor11v decrease by the decrease in size of the molecule Theabsolute quantity of it worsens the model and it should beused in the model in the present form including negativeand positive values for different peptides The only similartrend was observed for alogp values of the peptides Thenegative Mor11v values belong to the molecules with neg-ative alogp values (less hydrophobic peptides) 3D-MoRSEdescriptors cannot encode the lipophilicity of the moleculesThe overall correlation of bitterness to this descriptor is posi-tive [28]

BioMed Research International 9

0

5

10

15

20

0 1 2 3 4 5 6

SPA

N

y = 21303x + 11869

R2 = 06217

log(1T)

(a)

0

01

02

03

04

05

0 1 2 3 4 5 6

MSD

y = minus00384x + 03733

R2 = 06613

log(1T)

(b)

0

02

04

06

08

0 1 2 3 4 5 6

E3S

y = 00419x + 00847

R2 = 01667

log(1T)(c)

G3p

01

01

02

02

03

0 1 2 3 4 5 6

y = minus00132x + 02095

R2 = 0396

log(1T)

(d)

00 2

0807060504030201

4 6

HAT

S8u

y = minus00744x + 05511

R2 = 0313

log(1T)(e)

0

1

2

15

05

0 2 4 6mor

11v

minus05

minus1

minus15

y = 01342x minus 03043

R2 = 01127

log(1T)

(f)

Figure 2 Correlation of selected descriptors with bitter activity (1198772 is coefficient of determination and is calculated using (A6) of theappendix)

35 Model Building Using Linear Model (MLR) The selecteddescriptors were used to develop six MLR equations contain-ing 1ndash6 descriptors The adjusted 1198772 showed that all descrip-tors improved the model fitting the 1198772 values accuratelydepicted the fitting improvement and the best-fitted modelcould be rewritten as follows

log( 1119879

) = 545 (plusmn063) + 010 (plusmn002) SPAN

+ 032 (plusmn009)Mor11v

minus 788 (plusmn125)MSD

minus 155 (plusmn030)HATS8u

minus 539 (plusmn21)G3p + 092 (plusmn035)E3s

1198772= 081 119865 = 12573

SEP = 008 MPD = 137 (plusmn132)

ND = 181(4)

where SEP stands for standard error of prediction and NDshows the number of data points The MPD for test andvalidation sets were 181 (plusmn155) (119873 = 36) and 169 (plusmn126)(119873 = 10) respectively The relative frequency analysis ofthe prediction errors showed that more than 50 of data canbe predicted by the prediction error of less than 15 whichis acceptable for biological measurements where the meanILRSD for bitter activities of 19 dipeptides were measured bydifferent research groups and were 121 (plusmn107) In additionthe IPD frequency trend (Figure 3) is similar for trainingand test sets The MPD for the peptides by log(1119879) valuesless than 20 was 243 (plusmn222) and for the peptides with

10 BioMed Research International

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

trai

ning

set

(a)

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

test

set

(b)

0

10

20

30

40

50

60

70

MLR SVM ANN

IPD

freq

uenc

y of

val

idat

ion

set

lt15

15ndash30gt30

(c)

Figure 3 IPD frequencies (IPD lt 15 IPD = 15ndash30 and IPD gt 30)of the training (top) test (middle) and validation (bottom) sets forMLR SVM and ANNmodels

log(1119879) more than 20 it was 111 (plusmn79) The highest IPDvalue (1000) in the training set was calculated for GR(log(1119879) = 100) and in the test set it was calculated forPGR (IPD = 654 and log(1119879) = 160) In other wordsthe developed model was not able to predict the bitterness ofless bitter peptides and the selected descriptors are specificfor the evaluation of the bitter peptidesrsquo interactions with thereceptor

LOO cross-validation was done using the PLS toolbox ofMATLAB software The 1199022 value and RMSE were 076 and044 respectivelyThe 1199022 values for LMO cross-validation arereported in Table 3 The ranges of 1199022 values were 061ndash088which was in an acceptable range compared with 1198772 values

Table 5 Leave-many-out cross-validation results for MLR model

Subset 1198772

1198772

adj 1199022

1 080 079 0882 080 079 0873 080 079 0864 080 080 0825 080 080 0826 081 080 0757 081 080 0788 082 081 0769 082 081 08410 082 082 061

Table 6 Chance correlation results

Shuffled 119884 1198772 Shuffled 119884 119877

2

1198841 001 1198846 0021198842 007 1198847 minus0021198843 001 1198848 0031198844 minus002 1198849 0011198845 001 11988410 000

The ranges of corresponding 1198772 and adjusted 1198772 for thedesired subsets were 080ndash082 and 079ndash082 respectively(see Table 5) Chance correlation (119884 randomization) analysiswas done using 10 times shuffled bitter activity and the results(1198772 = 001ndash010) rejected the possibility of fortune correlation(see Table 6)

36 Model Building Using Nonlinear Models (ANN and SVM)and Comparison with the Linear Model (MLR) The sixselected descriptors were introduced to ANN as input valuesand the bitter activity as output data and the networks weredeveloped using the Levenberg-Marquardt algorithm [38]The number of hidden layers was three In addition SVMmodels were developed using STATISTICA 7 software Usingthe training set three parameters of SVM were optimized by10-fold cross-validation The optimized values of 119862 120576 and 120574were 91 007 and 006 respectively The MPD values of theproposed models for training test and validation sets were138 168 and 169 for MLR 130 160 and 150 for SVM and131 161 and 148 for ANN all of which are in acceptableranges and there is no significant difference between thesesubsets The IPD frequencies (Figure 3) revealed that bothSVM and ANN methods produced more accurate resultsespecially for the test sets The plots of experimental versuspredicted values (Figure 4) confirmed the discussed results

37 External Validation of the Proposed Models Statisticalcriteria of external test sets are depicted in Table 7 Theresults show that the proposed models using linear andnonlinear models passed the proposed statistical criteria in

BioMed Research International 11

MLR

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

SVM

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

ANN

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

y = 0785x + 0625

R = 0892y = 0784x + 0596

R = 0899

y = 0800x + 0624

R = 0907

Figure 4 Experimental versus predicted plots for linear and nonlinear models (119877 is calculated using (A7) of the appendix)

Table 7 Statistical parameters for test sets of MLR SVM and ANNmodels

Statistical criteria MLR SVM ANN1198772gt 06 0723 0739 0767(1198772minus 1198772

0) 1198772lt 01 0001 0002 0004

085 le 119870 le 115 0999 1010 098910038161003816100381610038161198772

0minus 11987710158402

0

1003816100381610038161003816lt 03 0109 0113 0106

1198772

119898gt 05 0704 0712 0722

the literature and they are robust and valid for externalprediction

38 Comparison with Previous Model The only similarstudy was published by Kim and Li-Chan [3] They usedthe amino acid three 119911-scores along with three parameters(total hydrophobicity residue number and logmass value) todevelop a PLS model A comparison of the newly developedmodel with their model is summarized in Table 8 It shouldbe noted that the aim of this paper was to develop a generalmodel rather than different models for different subsets andreported correlation coefficients belong to the general modelwhich was computed for subsets The comparison of thecorresponding 119877 values showed that the developed modelcould represent the bitter activity variance of three and morepeptides better than the PLS model while for dipeptides

the PLS model produced more accurate results DevelopedSVM and ANN models resulted in more accurate results forthe total data set compared with both the PLS and MLRmodels (Tables 1 and 5 Figure 4)

4 Conclusion

General MLR SVM and ANN models were developed topredict the bitterness of 229 peptides and amino acidsThe capability of the MLR model to reveal the impact ofeach descriptor on bitter activity was its main advantagewhere more accurate predictions by SVM and ANN madethem suitable models for precise predictions Obviouslyindividual models (ie models developed for each peptidesubset) produced less prediction errors but consideringthe convenience of application of the general models suchmodels are preferred during the primary stages of peptideproduction and evaluation The developed models can beused in nutraceutical and pharmaceutical industries

Appendix

RMSE = radicsum(119884pred minus 119884exp(mean))

2

119873

(A1)

where 119873 denotes the number of data points and 119884pred and119884exp are the predicted and observed log 1119879 (119879 is the bitter

12 BioMed Research International

Table 8 Developed MLR model statistics for subsets of peptides compared with a previously developed model

Data set Developed model Previous modelND 119877

21198772

adj 119877 RMSE 119877 (PLS) 1198772 (PLS)

Dipeptidesa 76 051 047 072 041 063 040Dipeptidesb 45 062 056 079 044 091 083Dipeptidesc 47 059 053 077 042 085 072Three peptides 51 065 060 081 038 071 050Tetrapeptides 23 072 062 085 048 090 081Pentapeptides 12 089 080 094 037 088 077Hexapeptides 20 065 048 080 047 075 056Heptapeptides 16 079 068 089 038 095 090Octa-tetradecapeptides 24 057 042 076 044 mdash mdashWhole data setd 227 080 079 089 046 081 066Test and validation sets 46 076 073 087 056 mdash mdashTraining set 181 081 081 090 043 mdash mdashaAverage of experimental values was used when there were different values in different referencesbExperimental data were taken from different referencescExperimental data were taken from [3]dThe 119877 values for whole data set using SVM and ANNmethods are 090 and 091 respectively

threshold concentration (119872)) values

MSSReg =sum SSReg

df (A2)

MSSRes =sum SSRes

df (A3)

where SS is sum of squares MSS is mean square Reg isregression Res is residual and cv is cross validated

MSSTotal = MSSReg +MSSRes (A4)

119865 =

MSSRegMSSRes

(A5)

1198772= 1 minus

MSSResMSSTotal

(A6)

119877 = (1 minus

MSSResMSSTotal

)

05

(A7)

1199022= 1198772

cv = 1 minusMSSRescvMSSTotal

(A8)

Conflict of Interests

The authors certify that there is no financial relation betweenthem and the mentioned commercial identities in the paper

Acknowledgments

Theauthors thank the Biotechnology Research Center TabrizUniversity of Medical Sciences for a partial financial supportunder Grant no 585765

References

[1] A H Pripp and Y Ardo ldquoModelling relationship betweenangiotensin-(I)-converting enzyme inhibition and the bittertaste of peptidesrdquo Food Chemistry vol 102 no 3 pp 880ndash8882007

[2] A H Pripp T Isaksson L Stepaniak T Soslashrhaug and Y ArdoldquoQuantitative structure activity relationship modelling of pep-tides and proteins as a tool in food sciencerdquo Trends in FoodScience and Technology vol 16 no 11 pp 484ndash494 2005

[3] H O Kim and E C Y Li-Chan ldquoQuantitative structure-activityrelationship study of bitter peptidesrdquo Journal of Agricultural andFood Chemistry vol 54 no 26 pp 10102ndash10111 2006

[4] K Maehashi and L Huang ldquoBitter peptides and bitter tastereceptorsrdquo Cellular and Molecular Life Sciences vol 66 no 10pp 1661ndash1671 2009

[5] E G Walsh B E Adamczyk K B Chalasani et al ldquoOral deliv-ery of macromolecules rationale underpinning gastrointestinalpermeation enhancement technology (GIPET)rdquo TherapeuticDelivery vol 2 no 12 pp 1595ndash1610 2011

[6] R C Pinto C Silva N Martinho et al ldquoDrug carriers for oraldelivery of peptides and proteins accomplishments and futureperspectivesrdquoTherapeutic Delivery vol 4 pp 251ndash265 2013

[7] J Renukuntla A D Vadlapudi A Patel et al ldquoApproachesfor enhancing oral bioavailability of peptides and proteinsrdquoInternational Journal of Pharmaceutics vol 447 pp 75ndash93 2013

[8] S Swain ldquoStabilization and delivery approaches for protein andpeptide pharmaceuticals an extensive review of patentsrdquoRecentPatents on Biotechnology In press

[9] K Maehashi M Matano H Wang L A Vo Y Yamamoto andL Huang ldquoBitter peptides activate hTAS2Rs the human bitterreceptorsrdquo Biochemical and Biophysical Research Communica-tions vol 365 no 4 pp 851ndash855 2008

[10] K H Ney ldquoPrediction of bitterness of peptides from theiramino acid compositionrdquo Zeitschrift fur Lebensmittel-Untersu-chung und -Forschung vol 147 no 2 pp 64ndash68 1971

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003

2 BioMed Research International

formulation stability and degradation in the gastrointestinaltract Studies on the origins of formulations and alternativeadministrations to overcome the mentioned problems havesuggested different administration methods such as par-enteral oral transdermal nasal pulmonary rectal ocularbuccal and sublingual drug delivery systems [5ndash8] Tasteplays a crucial role in buccal and sublingual administrationsystems Bitter taste properties in relation with the structureof the peptides in fermented food and protein hydrolyzateshave been studied Findings have shown that hydrophobicityis correlated with bitterness and a hydrophobic interactionis needed for the bitter receptors (T

2Rs) to sense bitter-

ness whereas the amino acid sequence has no effect onbitterness [9 10] Moreover introducing amino acids intothe hydrophobic chain intensifies bitterness and blockingboth C and N terminals of peptides by acetylating increasesbitterness about ten times [4] It is now generally acceptedthat the side-chain hydrophobicity and the number of carbonatoms of the hydrophobic side chain of the peptidersquos aminoacids are correlated to bitterness rather than to overallhydrophobicity [4 9 11 12] In fact the hydrophobic group ofthe side chain offers a binding site for the bitter taste receptorAnother binding site is a bulky basic group including an 120572-amino group To provide such features hydrophobic aminoacids at the C-terminal and basic amino acids at the N-terminal are necessary The bitter peptides are composed ofless than eight amino acids and the bitterness increases asthe number of amino acids increases Peptides composed ofeight or more amino acids do not differ in bitter potency andthey form a spherical shape rather than a helix conformation[4 9]

Beyond structure-activity studies researchers have triedto develop quantitative models to predict the bitterness ofpeptides [3] and to evaluate the effect of the primary structureon the potency of therapeutic peptides [1 13] These studieswere aimed at understanding the peptide structures respon-sible for different activities (taste ACE inhibition antithrom-botic opioid and endopeptidase inhibitory antimicrobialactivities and immune modulator) and to accelerate thestudies about food-derived functional and bioactive peptides

Different quantitative structure bitter taste relationship(QSBR) models using various descriptors have been devel-oped for di- and tripeptides [1 3 13ndash16] the study of tetra-and higher peptides is limited [3] Some methods developedto predict the bitterness of dipeptides used amino acid-based descriptors which resulted in models with acceptableprediction capabilities

Yin et al [17] reported 28 developed models in the year2010 for modeling dipeptide bitterness in comparison withtheir own model which used E

1ndashE5amino acid variables

(hydrophobicity (E1) steric properties or side chain bulk

molecular size (E2) preferences for amino acids to occur in

120572-helices (E3) composition (E

4) and the net charge (E

5))

to develop QSBR models using support vector regression(SVM) Their model [17] successfully predicted (1198772 = 097)the bitterness of 48 dipeptides (1198772 values calculated using(A6) of appendix) Table 1 (28 models taken from a reference[17] + 2 other models) summarizes the developed modelsfor dipeptides along with their prediction errors (root mean

Table 1 Details of previously developed QSBR models

No Descriptors Model 1199022 (CV) 1198772 RMSE Ref1 MS-WHIM PLS 0633 0704 ndlowast [18]2 ISA-ECI PLS Nd 0847 nd [19]3 119911-scales PLS 0713 0824 026 [20]4 GRID PLS 0780 nd nd [21]5 MHDV PCR 0864 0919 018 [22]6 MEEV (M3) MLR 0588 0773 034 [23]7 MEEV (M4) MLR 0677 0734 034 [23]8 SSIA-AM1 PLS 0837 085 025 [24]9 SSIA-PM3 PLS 0829 0888 022 [24]10 SSIA-HF PLS 0798 0844 025 [24]11 SSIA-DFT PLS 0741 0856 024 [24]12 MARCH-INSIDE MLR 0860 0881 023 [25]13 Constitutional MLR 0820 0846 026 [25]14 Topological MLR 0890 091 020 [25]15 Molecular MLR 0580 0618 039 [25]16 BCUT MLR 0720 0783 030 [25]17 Galvez MLR 0560 0617 040 [25]18 2D MLR 0710 0753 032 [25]19 Randic MLR 0512 0559 042 [25]20 Geometrical MLR 0895 0909 020 [25]21 RDF MLR 0814 0851 025 [25]22 3D-MoRSE MLR 0880 0914 020 [25]23 GETAWAY MLR 0857 0889 022 [25]24 WHIM MLR 0799 0861 025 [25]25 SZOTT PLS 0736 0908 020 [26]26 VSW PLS 0696 0868 024 [16]27 V MLR 0921 0948 017 [15]28 E MLR 0888 0940 021 [27]29 E SVR 0912 0962 012 [17]30 119911-scores PLS 0800 0850 nd [3]lowastNo data

square errors (RMSE) which were calculated using (A1) ofthe appendix)

Kim and Li-Chan (2006) [3] studied the QSBR of 224di- to tetradecapeptides (2ndash14 amino acids peptides) usingmultiple linear regression (MLR) and partial least square(PLS) regression methods They used total hydrophobicitymolecular mass (log119872) and residual numbers to developtheir regressionmodels and the amino acid 119911-scores namely1199111 hydrophobicity 119911

2 bulkinessmolecular size and 119911

3 elec-

tronic property were calculated using the method developedby Hellberg et al [20] (individually or in combination withthe mentioned parameters) to develop a PLS model (Table 1number 30) They concluded that the developed modelsfor whole data sets and data subsets (comprising differentpeptide length) were significant and improved for subsets [3]Moreover the combination of 119911-scores and studied variablesimproved the correlation of predicted and observed values fordi- and tripeptides

QSBR model development for three or more amino acidpeptides has not been studied as well as that of dipeptides

BioMed Research International 3

This work aims to build suitable models for predicting thebitterness of 224 peptides and 5 amino acids on the basis oftheir calculated structural parameters Structural parameters(0Dndash3D) were calculated using Dragon [28] software Theselected parameters were used to develop linear and nonlin-ear models and the prediction capability and robustness ofthe proposedmodels were studied using internal and externalvalidation methods according to the QSAR method valida-tions guidelines [29] and references [30ndash33] The impacts ofthe selected parameters and structural features on bitternesswere evaluated according to the parameter definitions

2 Materials and Methods

21 Data Set A total of 229 experimental bitterness values(224 peptides and 5 amino acids) determined by humansensory evaluations were obtained from [3] The details ofpeptide sequences and the bitterness activity expressed as alog(1119879) (119879 being the bitter threshold concentration (119872)) aresummarized in Table 2 The ranges of bitterness were 10ndash57and the numbers of amino acids of the studied peptides were2ndash14

22 Calculated Descriptors The 2D structures of all mole-cules were drawn and converted to 3D structures usingthe HyperChem 7 software The completed model and themolecularmechanics energyminimizedmolecules were usedas inputs for the Dragon 54 software [28] The softwarecalculated 20 subsets of molecular descriptors including2D autocorrelation 3D-MoRSE descriptors centered frag-ments Burden eigenvalues connectivity indices constitu-tional descriptors edge adjacency indices eigenvalue-basedindices functional group counts geometrical descriptorsGETAWAYdescriptors information indicesmolecular prop-erties Randic molecular profiles RDF descriptors topolog-ical charge indices topological descriptors walk and pathcounts and WHIM descriptors The structural parameterscalculated after discarding the constant and near-constantvalues (1295 descriptors) were saved and further analyzedusing the SPSS 115 STATISTICA 7 and MATLAB 78 soft-ware

23 Outlier Detection In order to identify possible out-liers two different methods of principal component analysis(PCA) mapping and standard scores were used Accordingto the nature of the outliers which could be related tothe different mechanism of binding because of the differentstructural features recording errors or inaccurate design ofsamples no single outlier detection approach could identifyall kinds of outliers [34] In this study two approaches onein descriptor space (PCA plot of scores) and the secondin response space (standard score) were used to check theoutliers before the main numerical analysis [35]

24 Training-Test Selection The training set plays an impor-tant role in developing the properties of the model as themore similar the molecules for training the model are themore accurate the expected results are Thus the selection

Table 2 Details of the observed-predicted (using MLR SVM andANNmethods) bitterness and corresponding errors (IPD)

Peptide sequenceObserved MLR IPD SVM IPD ANN IPDTraining data set

R 160 120 251 181 134 190 185F 170 186 93 183 75 186 95P 190 187 16 236 243 216 135L 170 182 69 182 73 186 94V 170 162 50 161 53 168 11GR 100 200 1001 231 1307 206 1060RR 211 238 129 269 274 284 347PP 234 205 124 235 02 226 33KP 252 239 50 328 301 320 268PR 252 197 217 326 293 322 279RF 260 268 30 281 81 276 60RP 310 234 245 305 15 303 21LI 240 207 138 212 115 201 164KF 204 246 208 183 105 184 100VF 252 246 25 232 80 236 63VY 252 244 31 274 88 277 99YG 252 237 61 248 17 239 53YY 263 282 72 247 60 247 60FI 283 262 76 232 180 206 272IF 283 254 102 358 266 352 244YF 310 270 129 331 68 329 62VA 116 163 409 155 339 149 286VG 119 134 129 156 308 136 143PA 132 162 225 159 205 155 172IE 137 213 555 230 682 212 547IQ 149 191 283 197 325 193 297IS 149 180 207 167 124 180 207IT 149 211 413 208 396 203 362SL 149 199 337 214 438 192 287WE 156 263 686 270 730 265 696IK 165 240 455 242 465 233 413IA 168 177 55 164 23 166 11AL 170 152 109 178 47 177 38VV 171 159 69 170 05 163 48LA 172 184 69 172 01 176 23PY 180 209 159 193 73 202 121GW 189 218 155 200 57 215 139PL 222 198 109 221 02 195 120PI 233 185 205 210 99 192 176IP 240 206 141 225 63 203 153YL 240 252 52 261 89 255 61LY 246 279 134 289 173 276 122IW 305 284 69 286 63 276 95FY 313 280 104 290 75 284 92LW 340 263 226 280 178 286 160IV 190 180 53 231 215 206 82FV 223 255 144 229 28 206 74VE 223 199 106 257 154 246 103

4 BioMed Research International

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDYP 170 251 478 172 10 157 79PK 223 206 78 227 17 210 59LD 223 219 17 226 13 209 62AD 223 215 37 222 05 200 105LE 252 215 146 233 75 213 155GE 283 213 248 263 72 258 88VL 211 204 32 247 169 237 124GP 179 228 273 209 167 194 86FG 200 224 121 225 125 226 128PF 214 228 64 234 93 229 69GF 236 225 45 230 25 224 52GY 215 214 04 212 15 217 08LF 282 252 105 270 42 255 94FL 285 258 94 264 74 257 97GL 164 180 96 170 40 175 66LG 171 170 07 158 77 170 06IG 201 171 151 169 160 170 156GI 217 191 121 182 163 184 154LL 247 221 105 233 56 208 160II 254 205 191 224 119 205 191IL 254 212 166 229 99 208 182RGP 190 251 318 166 124 183 39FGG 234 227 32 210 103 209 109GFG 252 235 67 225 109 210 165PPP 270 239 116 262 31 242 103GLL 283 237 164 233 176 207 267RPG 310 271 127 246 205 237 235LGG 100 168 676 175 750 167 671GGL 200 199 04 200 02 192 41LGL 230 238 33 243 58 234 17LLG 230 235 21 239 40 231 03PIP 285 237 167 371 301 351 232LLL 292 270 75 306 48 294 09GYG 170 172 13 263 545 242 422GVV 234 205 124 298 273 307 313VVV 234 238 18 176 249 164 298PPF 263 248 57 338 284 338 285YGG 263 260 13 267 16 246 63RPF 283 328 159 255 98 244 138FGF 292 285 24 324 111 329 125DLL 310 318 26 263 152 267 137GFF 323 304 59 314 28 308 47FPF 340 331 28 338 06 325 44GYY 340 302 112 315 74 306 99PFP 340 288 153 274 195 258 242YYY 370 345 68 373 07 357 35GGV 148 193 302 175 184 179 210PGG 234 169 276 202 136 200 146GGF 283 242 144 239 154 236 168GGP 204 195 44 260 273 251 231

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDPPG 204 222 86 235 154 216 57FFG 265 293 105 306 155 299 128ELL 340 291 145 273 197 282 170FFF 370 344 71 428 157 394 64EGG 283 206 272 271 42 261 79YGY 310 284 85 287 74 279 100YPF 352 303 139 318 96 304 137GGLG 160 192 198 155 33 134 162FGFG 352 294 164 355 10 356 12GPFF 380 321 156 319 160 315 170RPGF 380 339 109 341 102 336 116FGGF 392 308 215 332 153 32 183RPFF 440 382 132 442 04 424 36GLGG 170 184 82 170 02 174 23LGGG 190 198 41 226 189 192 08PFPP 234 314 343 239 19 241 28GPPF 252 288 142 256 17 247 19RRPP 27 314 164 300 112 311 153VYPF 352 346 16 297 157 305 135PFIV 352 312 115 328 69 333 55FFPR 400 387 32 379 52 387 33FFPP 252 327 296 251 03 243 35FFPE 276 356 288 349 263 355 286GGFF 285 348 220 239 161 234 181FFPG 290 338 166 337 163 335 156LLLL 323 335 37 318 16 323 00FFGG 252 320 270 241 44 231 82RRPFF 470 415 117 469 03 463 14GGGLG 190 228 199 234 231 208 94GLGGG 190 215 129 177 70 169 108LGGGG 190 216 136 251 323 249 311GGVVV 211 284 346 221 47 202 45FFPGG 283 358 266 318 124 318 122PGPIP 311 325 46 326 48 319 26RGPPF 263 345 312 295 122 289 98PPFIV 292 330 129 264 96 269 79RPGFF 351 382 88 398 134 386 99RPGGFF 404 395 23 428 58 406 06PFPGPI 336 363 81 365 87 356 60RPFFGG 392 411 47 422 76 407 39RGPPGF 352 331 59 355 09 343 26RGPPFI 460 360 217 347 247 352 234RGPFIV 430 384 107 383 109 376 126RGGFIV 310 335 80 302 26 289 67RGPPFF 423 404 46 413 24 398 59RRPPGF 440 373 153 479 90 475 79RRPPFF 515 413 199 508 14 540 49GPPFIV 292 340 166 328 124 320 95GGFFGG 370 402 88 372 04 355 40KPPFIV 382 356 68 342 104 351 82

BioMed Research International 5

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDGGRPFF 404 407 06 398 15 383 51PVLGPV 330 325 16 326 12 321 28FPPFIV 352 366 40 418 188 409 163RGPPGGV 248 307 239 293 182 296 192RGPPGFF 440 397 97 404 83 393 107RGPPFFF 500 445 111 462 76 447 106VIIPFPG 360 408 133 349 31 35 28VIFPPGR 410 433 55 387 56 383 66RGPPGIG 278 349 255 341 227 343 235RGPPGGF 308 325 54 317 30 310 08YPFPGPI 380 409 77 421 107 405 66RPPPFFF 470 473 06 441 62 422 102VIPFPGR 415 413 05 438 56 423 19PFPGPIP 360 354 16 400 112 398 107RGPPGFG 368 356 31 347 57 349 52RGPFPIV 395 385 25 392 07 388 19RPFFRPFF 500 511 22 496 09 503 06RGPKPIIV 408 387 50 421 32 383 63VYPFPPGI 382 423 107 421 101 415 87RGPEPIIV 451 387 142 412 87 380 158RGPPGGFF 411 337 180 328 201 328 203GGRPFFGG 440 430 23 392 108 381 133RGPPGGGFF 395 405 26 424 74 409 35GGRGPPFIV 410 395 36 435 61 430 49RGPPFIVGG 431 406 57 404 63 401 71FFRPFFRPFF 515 549 67 421 182 406 211PVRGPFPIIV 540 416 229 482 108 418 226VYPFPPGINH 430 448 43 451 49 438 20VYPFPPGIGG 352 420 192 308 124 300 149VYPFGGGINH 364 396 87 401 101 388 66RPFFRPFFRPFF 500 544 89 504 08 522 44RGPPFIVRGPPFIV 440 492 118 387 120 375 148PVLGPVRGPFPIIV 483 436 97 511 59 448 72

Test data setRG 211 197 64 229 84 201 46AV 116 171 476 159 374 158 363ID 137 207 514 207 512 197 439VD 190 194 22 236 245 215 134LV 223 189 151 175 216 182 184VI 223 190 147 199 109 196 121FP 277 254 83 265 43 246 111FF 301 275 86 289 40 282 65AF 181 241 334 255 407 239 318PGR 160 265 654 260 623 265 659RRR 240 355 480 352 468 355 478FIV 283 277 21 268 52 271 42GGY 283 244 138 255 97 245 135GRP 310 264 148 254 181 264 150YYG 320 310 32 317 09 311 29KPF 340 318 64 314 76 311 85GLG 200 158 213 157 215 169 157

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDGPG 170 209 232 211 240 201 181KPK 252 297 177 282 118 302 199VYP 252 297 177 287 139 290 150GGGL 234 226 32 239 21 210 100RPFG 341 347 18 337 11 334 19RGFF 380 369 30 395 40 383 07GGLGG 190 229 206 193 17 192 09GGGGL 265 221 166 240 95 217 179PGPGPG 260 310 192 314 206 309 190PFPIIV 390 343 120 355 90 340 127VIFPPG 268 380 418 371 384 366 366RGPPFIV 430 386 103 375 127 378 121VYPFPPG 352 406 154 409 161 395 122RGPFPIIV 540 414 233 440 185 414 234RGPGPIIV 481 384 203 426 115 388 192RRPPPFFF 570 478 162 476 165 469 176VIIPFPGR 385 454 180 447 161 451 171VYPFPPIGNH 430 446 37 448 41 438 18GGRGPPFIVGG 440 421 43 421 44 411 66

Validation data setIN 149 211 419 212 424 201 348WW 360 299 169 318 117 305 153GV 174 222 278 219 258 208 198FPK 252 310 229 294 167 314 247FPP 234 265 133 256 96 262 121PGP 204 256 254 257 260 241 180VIF 289 278 38 279 36 282 23RPPFIV 410 374 87 361 120 365 110RGPPFGG 323 337 45 323 01 334 33RGPPFIIV 430 412 41 420 24 399 72

of the training and test sets is one of the most importantsteps in model development and should be done so thateach set reflects the original data set as much as possible 119870-means clustering is one of the most frequently used methodsof data set splitting This procedure identifies relativelyhomogeneous groups of molecules (each observation has thenearest distance to the mean of the cluster) based on selectedproperties (biologic activities and structural parameters)Using this method we divided the data set into 10 clustersand three subsets (ie training test and validation sets) wereselected from themThe validation set was excluded from thestudy before the descriptor selection step and the test set wasexcluded before the model development step

25 Descriptor Selection In order to reduce the dimensionof the variable matrix a correlation analysis was carriedout and both the highly correlated and constant variableswere excluded A home-developed MATLAB toolbox wasemployed that considered the following criteria for excludinga variable [1] the intercorrelation with other descriptorshigher than 099 [2] the correlation with activity lower than

6 BioMed Research International

Bitterness

Total descriptors

(1) Removing constantdescriptors

(2) Removing highlycorrelated descriptors

Descriptors Descriptors

Descriptors

Training test andvalidation sets selection

using K-means clustering

Fina

l sele

cted

de

scrip

tors

Stepwise regressiondescriptor selection G

A-PL

S se

lect

edde

scrip

tors

GA-PLS descriptorselection

Model development(1) MLR(2) SVM(3) ANN

Prediction of thebitterness for training test

and validation sets

Validation of thedeveloped modelsaccording to the

guidelines

Figure 1 Procedure of descriptor selection QSBR model development and validation processes

intercorrelated descriptors and [3] the frequency of repeatedvalues for a descriptor (lower than 10 of cases)

After this step a genetic algorithm-partial least square(GA-PLS) algorithm [36] was used to select the most signif-icant variables GA-PLS (a combination of genetic algorithmand PLS regression) was also developed and utilized as avariable selection method in QSAR and QSPR studies byLeardi [36] and applied for QSAR studies [37 38] TheMATLAB 78 software was used to run the GA-PLS methoddeveloped by Leardi The variables were divided into sub-groups (containing up to 200 variables) and each subgroupwith the corresponding log 1119879 values (119884) was introduced tothe algorithm as input The output (containing descriptorsscoring based on the cross-validated percentage of explainedvariance) was produced after 100 runs As the results of theGA-PLS were a bit different for each run the top 50 scoresof each subgroup were combined with the next 100 variablesand this step was repeated up to the final step A comparativestudy was done in this study to check the capability of GA-PLS for descriptor selection in comparison with commonmethods (ie PLS and stepwise) Results showed that GA-PLS was able to extract more significant descriptorsThe onlydrawback of this method was the scoring differences betweendifferent runs which could be solved using the mentionedprocedure (ie selection of 50 of each run instead of limiteddescriptors [37])

Twenty percent of high score variables of the final stepwere selected as the most significant variables and were fur-ther studied by stepwise regression and bivariate correlationanalysis in which the selected variables using stepwise regres-sionwere investigated concerning their intercorrelationsTheless intercorrelated variables were selected for the final modeldevelopment process The complete procedure of descriptorselection is summarized in Figure 1

26 Model Building Linear and nonlinear models weredeveloped using the selected descriptors The details of themodel development and validation process are discussed inthe next sections

27 Linear Model Using Multiple Linear Regression (MLR)The selected parameters were used for developing QSBRequations using the multiple linear regression correlation(MLR) method and the goodness of fit and statistical sig-nificance of the models were evaluated using 1198772 (coefficientof determination) 119865 (variance ratio) and the MPD (meanpercentage deviation) values calculated using

MPD = 100119873

sum

100381610038161003816100381610038161003816100381610038161003816

119884pred minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(1)

The relative frequency of individual percentage deviations(IPD) was studied in order to define the model predictioncapacity for each data point The IPD was calculated using

IPD = 100100381610038161003816100381610038161003816100381610038161003816

119884pred minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(2)

In order to compare the MPD values with the relativestandard deviations (RSD) between experimental bitter val-ues measured by different research groups (inter laboratoryrelative standard deviations (ILRSD)) the ILRSD values forthe available data were computed using

ILRSD = 100119873

100381610038161003816100381610038161003816100381610038161003816

119884exp minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(3)

BioMed Research International 7

28 Nonlinear Models Using Support Vector Regression (SVR)and Artificial Neural Network (ANN) In the next stage theselected descriptors were used to derive non-linear modelsusing SVR and ANN To construct an ANN the Levenberg-Marquardt algorithm [39] was used by nftool toolbox ofMATLAB 78 software to train the network Selected descrip-tors and the bitter activity of training sets were introducedas input and output values respectively The training set wasrandomly classified for training validation and test sets inorder to avoid overfitting and then networks were trainedSVR another non-linear model constructs a hyperplane ina multidimensional space that provides minimum error byemploying a non-linear Kernel function Some parameterssuch as capacity parameter (119862) 120576 are related to type of noisein the data and 120574 is related to radial base function (RBF)which is the most common type of Kernel functions TheSVMmodel and optimization of parameters were done usingSTATISTICA 7 software

29 Model Validation The developed models were evaluatedusing the leave-many-out (LMO) cross-validation methodThe 1199022 values were calculated using (A8) of the appendix

210 Chance Correlation To check the possibility of a chancecorrelation 10 times shuffled activities were correlated tothe variables and the produced regression coefficients werecompared with the developed model regression coefficient

211 External Validation of Proposed Models Using the Exter-nal Test Set A set of statistical criteria [30 31] was employedto analyze 36 data points of the test set

(1) 1198772 gt 06

where 1198772 is the coefficient of determination between thepredicted and observed values

(2) (1198772 minus 11987720)1198772lt 01 and 085 le 119870 le 115 or (1198772 minus

11987710158402

0)1198772lt 01 and 085 le 1198701015840 le 115

where 11987720is the coefficient of determination obtained using

predicted values relative to a regression line fit for experimen-tal values and required to pass through the origin and 11987710158402

0

is the corresponding coefficient obtained using experimentalvalues relative to a regression line fit for predicted values andrequired to pass through the origin 119870 and 1198701015840 are the slopesof regression lines through the origin for fits for experimentaland predicted data respectively

(3) |11987720minus 11987710158402

0| lt 03

In addition another criterion proposed by P P Roy andK Roy [32] was considered as

(4) 1198772119898= 1198772(1 minus radic119877

2minus 1198772

0)

in which 1198772119898gt 05 indicates the good external predictability

of the QSAR models

3 Results and Discussion

31 Outlier Detection According to standard score analysisthere is no outlier The PCA map of scores showed that mostof the data points fell into the acceptable data space and thereis no outlier in the studied data set

32 Training-Test Selection The selected training test andvalidation data sets are listed in Table 2 along with thecomputed accuracy criteria The activity ranges for trainingtest and validation sets were 100ndash540 116ndash570 and 149ndash540 respectivelyThe number of data points for the trainingtest and validation sets was 181 36 and 10 respectively

33 Descriptor Selection A total of 1292 descriptors were cal-culated using Dragon 54 softwareThis number was reducedto 244 descriptors after correlation analysis was performedusing a home-developed toolbox The remaining descriptorswere analyzed using the GA-PLS method and the best scor-ing descriptors (descriptors included inTable 3)were selectedfor further analysis using stepwise regression and bivariatecross-correlation studies (the intercorrelation between theselected descriptors was less than 09) After excluding thenonsignificant descriptors the remaining descriptors werechecked to find the possible intercorrelation and the finaldescriptors were selected using a stepwise regression method(see Table 4) The final six selected descriptors were againchecked for intercorrelation Six descriptors are suitable forthe construction of a QSAR model [40]

34 Evaluation of the Selected Descriptors SPAN belongsto the size descriptors which evaluate the dimension ofthe molecule and often calculate it from the moleculargeometry This descriptor is a suitable size descriptor formacromolecules [41] and its selection for the studied dataset with the developed method showed the method to besuitable for descriptor selectionThe correlation study results(Figure 2) showed that bitterness increased upon theenhancement of the SPAN (Figure 2(a)) This finding is inagreement with previous findings [1ndash13] which suggested asignificant correlation between peptide size and bitternessThe investigation of the correlation of peptide subsets withthis descriptor showed less correlations in comparison withthe whole dataset Therefore SPAN can be regarded as ageneral descriptor rather than a specific one for peptidesubsets Thus SPAN can be used in the primary steps ofpeptide bioactive compound discovery to recognize bittercompound fractions

Mean square distance (MSD) index (Balaban) [28 41]contributes negatively to bitterness (Figure 2(b)) MSD de-creases with the increasing of molecular branching in anisomeric set [41] Furthermore it decreases with an increasein the number of atomsThis is in agreementwith the findingson the increase in bitterness following an increase in thenumber of amino acids in the studied peptides The lowercorrelation coefficient between peptide subsets by MSD incomparison with the total dataset showed that MSD isa general bitterness indicator rather than a subset bitternessdescriptor

8 BioMed Research International

Table 3 GA-PLS selected descriptors along with descriptive fortraining data set (181 peptides)

Name Range Minimum Maximum Mean Std deviationrdf025p 11270 407 11677 3090 2113mor15u 381 minus190 191 018 065eeig08x 421 minus024 397 216 121mor11m 213 minus121 092 minus015 040rdf025v 10821 400 11220 2952 2028mor15e 373 minus157 216 026 070mor21m 386 minus393 minus008 minus090 064belp7 188 000 188 118 042rdf075m 14442 000 14442 1628 2418mor31p 185 014 198 054 036ggi6 387 000 387 077 074mor11v 254 minus100 155 010 040mor31v 159 011 170 045 031ncs 2900 100 3000 726 550ti1 621722 minus8725 612997 25545 69986e3s 064 000 064 021 010gmti 131885700 34700 131920400 8106404 17363751vep1 818 261 1078 528 163ats1p 288 181 468 312 060smtiv 77560100 34100 77594200 4840738 10318764behv6 284 096 379 282 053alogp 807 minus281 526 038 145rtp a 2563 490 3053 1302 531c002 1800 000 1800 413 338j3d 1070 241 1310 535 246idmt 142611569 18231 142629800 7586086 18440059mats2m 032 minus016 016 004 006mor11p 280 minus129 151 005 043mats2v 142611569 18231 142629800 7586086 18440059hats8u 070 000 070 033 013smti 47759800 16800 47776600 2905336 6331178mor05u 3643 minus3823 minus180 minus1014 651hats8e 068 000 068 034 013mats2e 031 minus015 015 003 006l1u 5345 241 5586 1298 933rbn 4600 100 4700 1161 803h6m 181 000 181 034 036rdf080e 38799 000 38799 3941 6182rtv a 2323 399 2721 1150 482

Bitterness increases with the enhancement of E3s(Figure 2(c)) E3s is one of the accessibility directionalWHIM indiceswhich isweighed by atomic electrotopologicalstates [28 41] Indeed this descriptor defines size shapepolarizability and the conformational properties of the stud-ied molecules together Its correlation with the bitterness isweaker than MSD and SPAN descriptors but its removalfrom the model decreases the correlation coefficient signif-icantly It is not a strong identifier for utilization in drugdevelopment processes

Table 4 Intercorrelation of final descriptors

log(1119879) SPAN Mor11v MSD HATS8u G3p E3slog(1119879) 100SPAN 079 100Mor11v 034 009 100MSD minus081 minus078 minus020 100HATS8u minus056 minus052 minus025 038 100G3p minus063 minus064 minus013 064 030 100E3s 041 022 031 minus038 minus016 minus025 100

G3p the 3rd-component symmetry directional WHIMindex (weighed by polarizability) [28 41] showed a negativecorrelation with bitterness (Figure 2(d)) Along with E3sthese descriptors contribute to the electrical properties of themolecule G3p decreases with the increase in the number ofamino acids In fact more bitter compounds possess largerG3p values Similar to the SPAN and MSD descriptors thecorrelations with subsets are less than the general datasetand this descriptor can be regarded as a general identifierHATS8u (Figure 2(e)) belongs to the GETAWAY descriptorsThese descriptors are molecular descriptors derived fromthe molecular influence matrix (MIM) [12 28] HATSindices known as spatial autocorrelation based descriptorsencode information nonstructural fragments [28 41] Theyare suitable for describing differences in a congeneric seriesof molecules The effective position of substituent andfragments in molecular space and information about themolecular size and shape can be encoded by unweightedHATS indices Moreover they are independent of moleculealignment

In summary GETAWAY descriptors encoded local infor-mation while WHIM descriptors related to the holisticinformation of the molecules thus their joint uses areadvised HATS8u decreases with an increase in molecularsize The negative relation of this descriptor with bitterness isin agreement with the findings about the size and bitternessrelation

3D-MoRSE descriptors (3D molecule representation ofstructures based on electron diffraction) are derived frominfrared spectra simulation using a generalized scatteringfunction [28] Mor11v weighed by van der Waals volumebelongs to these descriptors which can be regarded as anindicator of size mass and volume of the molecules Bydecreasing the size of the peptide the Mor11v values tendtoward zero (Figure 2(f)) In fact the absolute values ofMor11v decrease by the decrease in size of the molecule Theabsolute quantity of it worsens the model and it should beused in the model in the present form including negativeand positive values for different peptides The only similartrend was observed for alogp values of the peptides Thenegative Mor11v values belong to the molecules with neg-ative alogp values (less hydrophobic peptides) 3D-MoRSEdescriptors cannot encode the lipophilicity of the moleculesThe overall correlation of bitterness to this descriptor is posi-tive [28]

BioMed Research International 9

0

5

10

15

20

0 1 2 3 4 5 6

SPA

N

y = 21303x + 11869

R2 = 06217

log(1T)

(a)

0

01

02

03

04

05

0 1 2 3 4 5 6

MSD

y = minus00384x + 03733

R2 = 06613

log(1T)

(b)

0

02

04

06

08

0 1 2 3 4 5 6

E3S

y = 00419x + 00847

R2 = 01667

log(1T)(c)

G3p

01

01

02

02

03

0 1 2 3 4 5 6

y = minus00132x + 02095

R2 = 0396

log(1T)

(d)

00 2

0807060504030201

4 6

HAT

S8u

y = minus00744x + 05511

R2 = 0313

log(1T)(e)

0

1

2

15

05

0 2 4 6mor

11v

minus05

minus1

minus15

y = 01342x minus 03043

R2 = 01127

log(1T)

(f)

Figure 2 Correlation of selected descriptors with bitter activity (1198772 is coefficient of determination and is calculated using (A6) of theappendix)

35 Model Building Using Linear Model (MLR) The selecteddescriptors were used to develop six MLR equations contain-ing 1ndash6 descriptors The adjusted 1198772 showed that all descrip-tors improved the model fitting the 1198772 values accuratelydepicted the fitting improvement and the best-fitted modelcould be rewritten as follows

log( 1119879

) = 545 (plusmn063) + 010 (plusmn002) SPAN

+ 032 (plusmn009)Mor11v

minus 788 (plusmn125)MSD

minus 155 (plusmn030)HATS8u

minus 539 (plusmn21)G3p + 092 (plusmn035)E3s

1198772= 081 119865 = 12573

SEP = 008 MPD = 137 (plusmn132)

ND = 181(4)

where SEP stands for standard error of prediction and NDshows the number of data points The MPD for test andvalidation sets were 181 (plusmn155) (119873 = 36) and 169 (plusmn126)(119873 = 10) respectively The relative frequency analysis ofthe prediction errors showed that more than 50 of data canbe predicted by the prediction error of less than 15 whichis acceptable for biological measurements where the meanILRSD for bitter activities of 19 dipeptides were measured bydifferent research groups and were 121 (plusmn107) In additionthe IPD frequency trend (Figure 3) is similar for trainingand test sets The MPD for the peptides by log(1119879) valuesless than 20 was 243 (plusmn222) and for the peptides with

10 BioMed Research International

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

trai

ning

set

(a)

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

test

set

(b)

0

10

20

30

40

50

60

70

MLR SVM ANN

IPD

freq

uenc

y of

val

idat

ion

set

lt15

15ndash30gt30

(c)

Figure 3 IPD frequencies (IPD lt 15 IPD = 15ndash30 and IPD gt 30)of the training (top) test (middle) and validation (bottom) sets forMLR SVM and ANNmodels

log(1119879) more than 20 it was 111 (plusmn79) The highest IPDvalue (1000) in the training set was calculated for GR(log(1119879) = 100) and in the test set it was calculated forPGR (IPD = 654 and log(1119879) = 160) In other wordsthe developed model was not able to predict the bitterness ofless bitter peptides and the selected descriptors are specificfor the evaluation of the bitter peptidesrsquo interactions with thereceptor

LOO cross-validation was done using the PLS toolbox ofMATLAB software The 1199022 value and RMSE were 076 and044 respectivelyThe 1199022 values for LMO cross-validation arereported in Table 3 The ranges of 1199022 values were 061ndash088which was in an acceptable range compared with 1198772 values

Table 5 Leave-many-out cross-validation results for MLR model

Subset 1198772

1198772

adj 1199022

1 080 079 0882 080 079 0873 080 079 0864 080 080 0825 080 080 0826 081 080 0757 081 080 0788 082 081 0769 082 081 08410 082 082 061

Table 6 Chance correlation results

Shuffled 119884 1198772 Shuffled 119884 119877

2

1198841 001 1198846 0021198842 007 1198847 minus0021198843 001 1198848 0031198844 minus002 1198849 0011198845 001 11988410 000

The ranges of corresponding 1198772 and adjusted 1198772 for thedesired subsets were 080ndash082 and 079ndash082 respectively(see Table 5) Chance correlation (119884 randomization) analysiswas done using 10 times shuffled bitter activity and the results(1198772 = 001ndash010) rejected the possibility of fortune correlation(see Table 6)

36 Model Building Using Nonlinear Models (ANN and SVM)and Comparison with the Linear Model (MLR) The sixselected descriptors were introduced to ANN as input valuesand the bitter activity as output data and the networks weredeveloped using the Levenberg-Marquardt algorithm [38]The number of hidden layers was three In addition SVMmodels were developed using STATISTICA 7 software Usingthe training set three parameters of SVM were optimized by10-fold cross-validation The optimized values of 119862 120576 and 120574were 91 007 and 006 respectively The MPD values of theproposed models for training test and validation sets were138 168 and 169 for MLR 130 160 and 150 for SVM and131 161 and 148 for ANN all of which are in acceptableranges and there is no significant difference between thesesubsets The IPD frequencies (Figure 3) revealed that bothSVM and ANN methods produced more accurate resultsespecially for the test sets The plots of experimental versuspredicted values (Figure 4) confirmed the discussed results

37 External Validation of the Proposed Models Statisticalcriteria of external test sets are depicted in Table 7 Theresults show that the proposed models using linear andnonlinear models passed the proposed statistical criteria in

BioMed Research International 11

MLR

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

SVM

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

ANN

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

y = 0785x + 0625

R = 0892y = 0784x + 0596

R = 0899

y = 0800x + 0624

R = 0907

Figure 4 Experimental versus predicted plots for linear and nonlinear models (119877 is calculated using (A7) of the appendix)

Table 7 Statistical parameters for test sets of MLR SVM and ANNmodels

Statistical criteria MLR SVM ANN1198772gt 06 0723 0739 0767(1198772minus 1198772

0) 1198772lt 01 0001 0002 0004

085 le 119870 le 115 0999 1010 098910038161003816100381610038161198772

0minus 11987710158402

0

1003816100381610038161003816lt 03 0109 0113 0106

1198772

119898gt 05 0704 0712 0722

the literature and they are robust and valid for externalprediction

38 Comparison with Previous Model The only similarstudy was published by Kim and Li-Chan [3] They usedthe amino acid three 119911-scores along with three parameters(total hydrophobicity residue number and logmass value) todevelop a PLS model A comparison of the newly developedmodel with their model is summarized in Table 8 It shouldbe noted that the aim of this paper was to develop a generalmodel rather than different models for different subsets andreported correlation coefficients belong to the general modelwhich was computed for subsets The comparison of thecorresponding 119877 values showed that the developed modelcould represent the bitter activity variance of three and morepeptides better than the PLS model while for dipeptides

the PLS model produced more accurate results DevelopedSVM and ANN models resulted in more accurate results forthe total data set compared with both the PLS and MLRmodels (Tables 1 and 5 Figure 4)

4 Conclusion

General MLR SVM and ANN models were developed topredict the bitterness of 229 peptides and amino acidsThe capability of the MLR model to reveal the impact ofeach descriptor on bitter activity was its main advantagewhere more accurate predictions by SVM and ANN madethem suitable models for precise predictions Obviouslyindividual models (ie models developed for each peptidesubset) produced less prediction errors but consideringthe convenience of application of the general models suchmodels are preferred during the primary stages of peptideproduction and evaluation The developed models can beused in nutraceutical and pharmaceutical industries

Appendix

RMSE = radicsum(119884pred minus 119884exp(mean))

2

119873

(A1)

where 119873 denotes the number of data points and 119884pred and119884exp are the predicted and observed log 1119879 (119879 is the bitter

12 BioMed Research International

Table 8 Developed MLR model statistics for subsets of peptides compared with a previously developed model

Data set Developed model Previous modelND 119877

21198772

adj 119877 RMSE 119877 (PLS) 1198772 (PLS)

Dipeptidesa 76 051 047 072 041 063 040Dipeptidesb 45 062 056 079 044 091 083Dipeptidesc 47 059 053 077 042 085 072Three peptides 51 065 060 081 038 071 050Tetrapeptides 23 072 062 085 048 090 081Pentapeptides 12 089 080 094 037 088 077Hexapeptides 20 065 048 080 047 075 056Heptapeptides 16 079 068 089 038 095 090Octa-tetradecapeptides 24 057 042 076 044 mdash mdashWhole data setd 227 080 079 089 046 081 066Test and validation sets 46 076 073 087 056 mdash mdashTraining set 181 081 081 090 043 mdash mdashaAverage of experimental values was used when there were different values in different referencesbExperimental data were taken from different referencescExperimental data were taken from [3]dThe 119877 values for whole data set using SVM and ANNmethods are 090 and 091 respectively

threshold concentration (119872)) values

MSSReg =sum SSReg

df (A2)

MSSRes =sum SSRes

df (A3)

where SS is sum of squares MSS is mean square Reg isregression Res is residual and cv is cross validated

MSSTotal = MSSReg +MSSRes (A4)

119865 =

MSSRegMSSRes

(A5)

1198772= 1 minus

MSSResMSSTotal

(A6)

119877 = (1 minus

MSSResMSSTotal

)

05

(A7)

1199022= 1198772

cv = 1 minusMSSRescvMSSTotal

(A8)

Conflict of Interests

The authors certify that there is no financial relation betweenthem and the mentioned commercial identities in the paper

Acknowledgments

Theauthors thank the Biotechnology Research Center TabrizUniversity of Medical Sciences for a partial financial supportunder Grant no 585765

References

[1] A H Pripp and Y Ardo ldquoModelling relationship betweenangiotensin-(I)-converting enzyme inhibition and the bittertaste of peptidesrdquo Food Chemistry vol 102 no 3 pp 880ndash8882007

[2] A H Pripp T Isaksson L Stepaniak T Soslashrhaug and Y ArdoldquoQuantitative structure activity relationship modelling of pep-tides and proteins as a tool in food sciencerdquo Trends in FoodScience and Technology vol 16 no 11 pp 484ndash494 2005

[3] H O Kim and E C Y Li-Chan ldquoQuantitative structure-activityrelationship study of bitter peptidesrdquo Journal of Agricultural andFood Chemistry vol 54 no 26 pp 10102ndash10111 2006

[4] K Maehashi and L Huang ldquoBitter peptides and bitter tastereceptorsrdquo Cellular and Molecular Life Sciences vol 66 no 10pp 1661ndash1671 2009

[5] E G Walsh B E Adamczyk K B Chalasani et al ldquoOral deliv-ery of macromolecules rationale underpinning gastrointestinalpermeation enhancement technology (GIPET)rdquo TherapeuticDelivery vol 2 no 12 pp 1595ndash1610 2011

[6] R C Pinto C Silva N Martinho et al ldquoDrug carriers for oraldelivery of peptides and proteins accomplishments and futureperspectivesrdquoTherapeutic Delivery vol 4 pp 251ndash265 2013

[7] J Renukuntla A D Vadlapudi A Patel et al ldquoApproachesfor enhancing oral bioavailability of peptides and proteinsrdquoInternational Journal of Pharmaceutics vol 447 pp 75ndash93 2013

[8] S Swain ldquoStabilization and delivery approaches for protein andpeptide pharmaceuticals an extensive review of patentsrdquoRecentPatents on Biotechnology In press

[9] K Maehashi M Matano H Wang L A Vo Y Yamamoto andL Huang ldquoBitter peptides activate hTAS2Rs the human bitterreceptorsrdquo Biochemical and Biophysical Research Communica-tions vol 365 no 4 pp 851ndash855 2008

[10] K H Ney ldquoPrediction of bitterness of peptides from theiramino acid compositionrdquo Zeitschrift fur Lebensmittel-Untersu-chung und -Forschung vol 147 no 2 pp 64ndash68 1971

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003

BioMed Research International 3

This work aims to build suitable models for predicting thebitterness of 224 peptides and 5 amino acids on the basis oftheir calculated structural parameters Structural parameters(0Dndash3D) were calculated using Dragon [28] software Theselected parameters were used to develop linear and nonlin-ear models and the prediction capability and robustness ofthe proposedmodels were studied using internal and externalvalidation methods according to the QSAR method valida-tions guidelines [29] and references [30ndash33] The impacts ofthe selected parameters and structural features on bitternesswere evaluated according to the parameter definitions

2 Materials and Methods

21 Data Set A total of 229 experimental bitterness values(224 peptides and 5 amino acids) determined by humansensory evaluations were obtained from [3] The details ofpeptide sequences and the bitterness activity expressed as alog(1119879) (119879 being the bitter threshold concentration (119872)) aresummarized in Table 2 The ranges of bitterness were 10ndash57and the numbers of amino acids of the studied peptides were2ndash14

22 Calculated Descriptors The 2D structures of all mole-cules were drawn and converted to 3D structures usingthe HyperChem 7 software The completed model and themolecularmechanics energyminimizedmolecules were usedas inputs for the Dragon 54 software [28] The softwarecalculated 20 subsets of molecular descriptors including2D autocorrelation 3D-MoRSE descriptors centered frag-ments Burden eigenvalues connectivity indices constitu-tional descriptors edge adjacency indices eigenvalue-basedindices functional group counts geometrical descriptorsGETAWAYdescriptors information indicesmolecular prop-erties Randic molecular profiles RDF descriptors topolog-ical charge indices topological descriptors walk and pathcounts and WHIM descriptors The structural parameterscalculated after discarding the constant and near-constantvalues (1295 descriptors) were saved and further analyzedusing the SPSS 115 STATISTICA 7 and MATLAB 78 soft-ware

23 Outlier Detection In order to identify possible out-liers two different methods of principal component analysis(PCA) mapping and standard scores were used Accordingto the nature of the outliers which could be related tothe different mechanism of binding because of the differentstructural features recording errors or inaccurate design ofsamples no single outlier detection approach could identifyall kinds of outliers [34] In this study two approaches onein descriptor space (PCA plot of scores) and the secondin response space (standard score) were used to check theoutliers before the main numerical analysis [35]

24 Training-Test Selection The training set plays an impor-tant role in developing the properties of the model as themore similar the molecules for training the model are themore accurate the expected results are Thus the selection

Table 2 Details of the observed-predicted (using MLR SVM andANNmethods) bitterness and corresponding errors (IPD)

Peptide sequenceObserved MLR IPD SVM IPD ANN IPDTraining data set

R 160 120 251 181 134 190 185F 170 186 93 183 75 186 95P 190 187 16 236 243 216 135L 170 182 69 182 73 186 94V 170 162 50 161 53 168 11GR 100 200 1001 231 1307 206 1060RR 211 238 129 269 274 284 347PP 234 205 124 235 02 226 33KP 252 239 50 328 301 320 268PR 252 197 217 326 293 322 279RF 260 268 30 281 81 276 60RP 310 234 245 305 15 303 21LI 240 207 138 212 115 201 164KF 204 246 208 183 105 184 100VF 252 246 25 232 80 236 63VY 252 244 31 274 88 277 99YG 252 237 61 248 17 239 53YY 263 282 72 247 60 247 60FI 283 262 76 232 180 206 272IF 283 254 102 358 266 352 244YF 310 270 129 331 68 329 62VA 116 163 409 155 339 149 286VG 119 134 129 156 308 136 143PA 132 162 225 159 205 155 172IE 137 213 555 230 682 212 547IQ 149 191 283 197 325 193 297IS 149 180 207 167 124 180 207IT 149 211 413 208 396 203 362SL 149 199 337 214 438 192 287WE 156 263 686 270 730 265 696IK 165 240 455 242 465 233 413IA 168 177 55 164 23 166 11AL 170 152 109 178 47 177 38VV 171 159 69 170 05 163 48LA 172 184 69 172 01 176 23PY 180 209 159 193 73 202 121GW 189 218 155 200 57 215 139PL 222 198 109 221 02 195 120PI 233 185 205 210 99 192 176IP 240 206 141 225 63 203 153YL 240 252 52 261 89 255 61LY 246 279 134 289 173 276 122IW 305 284 69 286 63 276 95FY 313 280 104 290 75 284 92LW 340 263 226 280 178 286 160IV 190 180 53 231 215 206 82FV 223 255 144 229 28 206 74VE 223 199 106 257 154 246 103

4 BioMed Research International

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDYP 170 251 478 172 10 157 79PK 223 206 78 227 17 210 59LD 223 219 17 226 13 209 62AD 223 215 37 222 05 200 105LE 252 215 146 233 75 213 155GE 283 213 248 263 72 258 88VL 211 204 32 247 169 237 124GP 179 228 273 209 167 194 86FG 200 224 121 225 125 226 128PF 214 228 64 234 93 229 69GF 236 225 45 230 25 224 52GY 215 214 04 212 15 217 08LF 282 252 105 270 42 255 94FL 285 258 94 264 74 257 97GL 164 180 96 170 40 175 66LG 171 170 07 158 77 170 06IG 201 171 151 169 160 170 156GI 217 191 121 182 163 184 154LL 247 221 105 233 56 208 160II 254 205 191 224 119 205 191IL 254 212 166 229 99 208 182RGP 190 251 318 166 124 183 39FGG 234 227 32 210 103 209 109GFG 252 235 67 225 109 210 165PPP 270 239 116 262 31 242 103GLL 283 237 164 233 176 207 267RPG 310 271 127 246 205 237 235LGG 100 168 676 175 750 167 671GGL 200 199 04 200 02 192 41LGL 230 238 33 243 58 234 17LLG 230 235 21 239 40 231 03PIP 285 237 167 371 301 351 232LLL 292 270 75 306 48 294 09GYG 170 172 13 263 545 242 422GVV 234 205 124 298 273 307 313VVV 234 238 18 176 249 164 298PPF 263 248 57 338 284 338 285YGG 263 260 13 267 16 246 63RPF 283 328 159 255 98 244 138FGF 292 285 24 324 111 329 125DLL 310 318 26 263 152 267 137GFF 323 304 59 314 28 308 47FPF 340 331 28 338 06 325 44GYY 340 302 112 315 74 306 99PFP 340 288 153 274 195 258 242YYY 370 345 68 373 07 357 35GGV 148 193 302 175 184 179 210PGG 234 169 276 202 136 200 146GGF 283 242 144 239 154 236 168GGP 204 195 44 260 273 251 231

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDPPG 204 222 86 235 154 216 57FFG 265 293 105 306 155 299 128ELL 340 291 145 273 197 282 170FFF 370 344 71 428 157 394 64EGG 283 206 272 271 42 261 79YGY 310 284 85 287 74 279 100YPF 352 303 139 318 96 304 137GGLG 160 192 198 155 33 134 162FGFG 352 294 164 355 10 356 12GPFF 380 321 156 319 160 315 170RPGF 380 339 109 341 102 336 116FGGF 392 308 215 332 153 32 183RPFF 440 382 132 442 04 424 36GLGG 170 184 82 170 02 174 23LGGG 190 198 41 226 189 192 08PFPP 234 314 343 239 19 241 28GPPF 252 288 142 256 17 247 19RRPP 27 314 164 300 112 311 153VYPF 352 346 16 297 157 305 135PFIV 352 312 115 328 69 333 55FFPR 400 387 32 379 52 387 33FFPP 252 327 296 251 03 243 35FFPE 276 356 288 349 263 355 286GGFF 285 348 220 239 161 234 181FFPG 290 338 166 337 163 335 156LLLL 323 335 37 318 16 323 00FFGG 252 320 270 241 44 231 82RRPFF 470 415 117 469 03 463 14GGGLG 190 228 199 234 231 208 94GLGGG 190 215 129 177 70 169 108LGGGG 190 216 136 251 323 249 311GGVVV 211 284 346 221 47 202 45FFPGG 283 358 266 318 124 318 122PGPIP 311 325 46 326 48 319 26RGPPF 263 345 312 295 122 289 98PPFIV 292 330 129 264 96 269 79RPGFF 351 382 88 398 134 386 99RPGGFF 404 395 23 428 58 406 06PFPGPI 336 363 81 365 87 356 60RPFFGG 392 411 47 422 76 407 39RGPPGF 352 331 59 355 09 343 26RGPPFI 460 360 217 347 247 352 234RGPFIV 430 384 107 383 109 376 126RGGFIV 310 335 80 302 26 289 67RGPPFF 423 404 46 413 24 398 59RRPPGF 440 373 153 479 90 475 79RRPPFF 515 413 199 508 14 540 49GPPFIV 292 340 166 328 124 320 95GGFFGG 370 402 88 372 04 355 40KPPFIV 382 356 68 342 104 351 82

BioMed Research International 5

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDGGRPFF 404 407 06 398 15 383 51PVLGPV 330 325 16 326 12 321 28FPPFIV 352 366 40 418 188 409 163RGPPGGV 248 307 239 293 182 296 192RGPPGFF 440 397 97 404 83 393 107RGPPFFF 500 445 111 462 76 447 106VIIPFPG 360 408 133 349 31 35 28VIFPPGR 410 433 55 387 56 383 66RGPPGIG 278 349 255 341 227 343 235RGPPGGF 308 325 54 317 30 310 08YPFPGPI 380 409 77 421 107 405 66RPPPFFF 470 473 06 441 62 422 102VIPFPGR 415 413 05 438 56 423 19PFPGPIP 360 354 16 400 112 398 107RGPPGFG 368 356 31 347 57 349 52RGPFPIV 395 385 25 392 07 388 19RPFFRPFF 500 511 22 496 09 503 06RGPKPIIV 408 387 50 421 32 383 63VYPFPPGI 382 423 107 421 101 415 87RGPEPIIV 451 387 142 412 87 380 158RGPPGGFF 411 337 180 328 201 328 203GGRPFFGG 440 430 23 392 108 381 133RGPPGGGFF 395 405 26 424 74 409 35GGRGPPFIV 410 395 36 435 61 430 49RGPPFIVGG 431 406 57 404 63 401 71FFRPFFRPFF 515 549 67 421 182 406 211PVRGPFPIIV 540 416 229 482 108 418 226VYPFPPGINH 430 448 43 451 49 438 20VYPFPPGIGG 352 420 192 308 124 300 149VYPFGGGINH 364 396 87 401 101 388 66RPFFRPFFRPFF 500 544 89 504 08 522 44RGPPFIVRGPPFIV 440 492 118 387 120 375 148PVLGPVRGPFPIIV 483 436 97 511 59 448 72

Test data setRG 211 197 64 229 84 201 46AV 116 171 476 159 374 158 363ID 137 207 514 207 512 197 439VD 190 194 22 236 245 215 134LV 223 189 151 175 216 182 184VI 223 190 147 199 109 196 121FP 277 254 83 265 43 246 111FF 301 275 86 289 40 282 65AF 181 241 334 255 407 239 318PGR 160 265 654 260 623 265 659RRR 240 355 480 352 468 355 478FIV 283 277 21 268 52 271 42GGY 283 244 138 255 97 245 135GRP 310 264 148 254 181 264 150YYG 320 310 32 317 09 311 29KPF 340 318 64 314 76 311 85GLG 200 158 213 157 215 169 157

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDGPG 170 209 232 211 240 201 181KPK 252 297 177 282 118 302 199VYP 252 297 177 287 139 290 150GGGL 234 226 32 239 21 210 100RPFG 341 347 18 337 11 334 19RGFF 380 369 30 395 40 383 07GGLGG 190 229 206 193 17 192 09GGGGL 265 221 166 240 95 217 179PGPGPG 260 310 192 314 206 309 190PFPIIV 390 343 120 355 90 340 127VIFPPG 268 380 418 371 384 366 366RGPPFIV 430 386 103 375 127 378 121VYPFPPG 352 406 154 409 161 395 122RGPFPIIV 540 414 233 440 185 414 234RGPGPIIV 481 384 203 426 115 388 192RRPPPFFF 570 478 162 476 165 469 176VIIPFPGR 385 454 180 447 161 451 171VYPFPPIGNH 430 446 37 448 41 438 18GGRGPPFIVGG 440 421 43 421 44 411 66

Validation data setIN 149 211 419 212 424 201 348WW 360 299 169 318 117 305 153GV 174 222 278 219 258 208 198FPK 252 310 229 294 167 314 247FPP 234 265 133 256 96 262 121PGP 204 256 254 257 260 241 180VIF 289 278 38 279 36 282 23RPPFIV 410 374 87 361 120 365 110RGPPFGG 323 337 45 323 01 334 33RGPPFIIV 430 412 41 420 24 399 72

of the training and test sets is one of the most importantsteps in model development and should be done so thateach set reflects the original data set as much as possible 119870-means clustering is one of the most frequently used methodsof data set splitting This procedure identifies relativelyhomogeneous groups of molecules (each observation has thenearest distance to the mean of the cluster) based on selectedproperties (biologic activities and structural parameters)Using this method we divided the data set into 10 clustersand three subsets (ie training test and validation sets) wereselected from themThe validation set was excluded from thestudy before the descriptor selection step and the test set wasexcluded before the model development step

25 Descriptor Selection In order to reduce the dimensionof the variable matrix a correlation analysis was carriedout and both the highly correlated and constant variableswere excluded A home-developed MATLAB toolbox wasemployed that considered the following criteria for excludinga variable [1] the intercorrelation with other descriptorshigher than 099 [2] the correlation with activity lower than

6 BioMed Research International

Bitterness

Total descriptors

(1) Removing constantdescriptors

(2) Removing highlycorrelated descriptors

Descriptors Descriptors

Descriptors

Training test andvalidation sets selection

using K-means clustering

Fina

l sele

cted

de

scrip

tors

Stepwise regressiondescriptor selection G

A-PL

S se

lect

edde

scrip

tors

GA-PLS descriptorselection

Model development(1) MLR(2) SVM(3) ANN

Prediction of thebitterness for training test

and validation sets

Validation of thedeveloped modelsaccording to the

guidelines

Figure 1 Procedure of descriptor selection QSBR model development and validation processes

intercorrelated descriptors and [3] the frequency of repeatedvalues for a descriptor (lower than 10 of cases)

After this step a genetic algorithm-partial least square(GA-PLS) algorithm [36] was used to select the most signif-icant variables GA-PLS (a combination of genetic algorithmand PLS regression) was also developed and utilized as avariable selection method in QSAR and QSPR studies byLeardi [36] and applied for QSAR studies [37 38] TheMATLAB 78 software was used to run the GA-PLS methoddeveloped by Leardi The variables were divided into sub-groups (containing up to 200 variables) and each subgroupwith the corresponding log 1119879 values (119884) was introduced tothe algorithm as input The output (containing descriptorsscoring based on the cross-validated percentage of explainedvariance) was produced after 100 runs As the results of theGA-PLS were a bit different for each run the top 50 scoresof each subgroup were combined with the next 100 variablesand this step was repeated up to the final step A comparativestudy was done in this study to check the capability of GA-PLS for descriptor selection in comparison with commonmethods (ie PLS and stepwise) Results showed that GA-PLS was able to extract more significant descriptorsThe onlydrawback of this method was the scoring differences betweendifferent runs which could be solved using the mentionedprocedure (ie selection of 50 of each run instead of limiteddescriptors [37])

Twenty percent of high score variables of the final stepwere selected as the most significant variables and were fur-ther studied by stepwise regression and bivariate correlationanalysis in which the selected variables using stepwise regres-sionwere investigated concerning their intercorrelationsTheless intercorrelated variables were selected for the final modeldevelopment process The complete procedure of descriptorselection is summarized in Figure 1

26 Model Building Linear and nonlinear models weredeveloped using the selected descriptors The details of themodel development and validation process are discussed inthe next sections

27 Linear Model Using Multiple Linear Regression (MLR)The selected parameters were used for developing QSBRequations using the multiple linear regression correlation(MLR) method and the goodness of fit and statistical sig-nificance of the models were evaluated using 1198772 (coefficientof determination) 119865 (variance ratio) and the MPD (meanpercentage deviation) values calculated using

MPD = 100119873

sum

100381610038161003816100381610038161003816100381610038161003816

119884pred minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(1)

The relative frequency of individual percentage deviations(IPD) was studied in order to define the model predictioncapacity for each data point The IPD was calculated using

IPD = 100100381610038161003816100381610038161003816100381610038161003816

119884pred minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(2)

In order to compare the MPD values with the relativestandard deviations (RSD) between experimental bitter val-ues measured by different research groups (inter laboratoryrelative standard deviations (ILRSD)) the ILRSD values forthe available data were computed using

ILRSD = 100119873

100381610038161003816100381610038161003816100381610038161003816

119884exp minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(3)

BioMed Research International 7

28 Nonlinear Models Using Support Vector Regression (SVR)and Artificial Neural Network (ANN) In the next stage theselected descriptors were used to derive non-linear modelsusing SVR and ANN To construct an ANN the Levenberg-Marquardt algorithm [39] was used by nftool toolbox ofMATLAB 78 software to train the network Selected descrip-tors and the bitter activity of training sets were introducedas input and output values respectively The training set wasrandomly classified for training validation and test sets inorder to avoid overfitting and then networks were trainedSVR another non-linear model constructs a hyperplane ina multidimensional space that provides minimum error byemploying a non-linear Kernel function Some parameterssuch as capacity parameter (119862) 120576 are related to type of noisein the data and 120574 is related to radial base function (RBF)which is the most common type of Kernel functions TheSVMmodel and optimization of parameters were done usingSTATISTICA 7 software

29 Model Validation The developed models were evaluatedusing the leave-many-out (LMO) cross-validation methodThe 1199022 values were calculated using (A8) of the appendix

210 Chance Correlation To check the possibility of a chancecorrelation 10 times shuffled activities were correlated tothe variables and the produced regression coefficients werecompared with the developed model regression coefficient

211 External Validation of Proposed Models Using the Exter-nal Test Set A set of statistical criteria [30 31] was employedto analyze 36 data points of the test set

(1) 1198772 gt 06

where 1198772 is the coefficient of determination between thepredicted and observed values

(2) (1198772 minus 11987720)1198772lt 01 and 085 le 119870 le 115 or (1198772 minus

11987710158402

0)1198772lt 01 and 085 le 1198701015840 le 115

where 11987720is the coefficient of determination obtained using

predicted values relative to a regression line fit for experimen-tal values and required to pass through the origin and 11987710158402

0

is the corresponding coefficient obtained using experimentalvalues relative to a regression line fit for predicted values andrequired to pass through the origin 119870 and 1198701015840 are the slopesof regression lines through the origin for fits for experimentaland predicted data respectively

(3) |11987720minus 11987710158402

0| lt 03

In addition another criterion proposed by P P Roy andK Roy [32] was considered as

(4) 1198772119898= 1198772(1 minus radic119877

2minus 1198772

0)

in which 1198772119898gt 05 indicates the good external predictability

of the QSAR models

3 Results and Discussion

31 Outlier Detection According to standard score analysisthere is no outlier The PCA map of scores showed that mostof the data points fell into the acceptable data space and thereis no outlier in the studied data set

32 Training-Test Selection The selected training test andvalidation data sets are listed in Table 2 along with thecomputed accuracy criteria The activity ranges for trainingtest and validation sets were 100ndash540 116ndash570 and 149ndash540 respectivelyThe number of data points for the trainingtest and validation sets was 181 36 and 10 respectively

33 Descriptor Selection A total of 1292 descriptors were cal-culated using Dragon 54 softwareThis number was reducedto 244 descriptors after correlation analysis was performedusing a home-developed toolbox The remaining descriptorswere analyzed using the GA-PLS method and the best scor-ing descriptors (descriptors included inTable 3)were selectedfor further analysis using stepwise regression and bivariatecross-correlation studies (the intercorrelation between theselected descriptors was less than 09) After excluding thenonsignificant descriptors the remaining descriptors werechecked to find the possible intercorrelation and the finaldescriptors were selected using a stepwise regression method(see Table 4) The final six selected descriptors were againchecked for intercorrelation Six descriptors are suitable forthe construction of a QSAR model [40]

34 Evaluation of the Selected Descriptors SPAN belongsto the size descriptors which evaluate the dimension ofthe molecule and often calculate it from the moleculargeometry This descriptor is a suitable size descriptor formacromolecules [41] and its selection for the studied dataset with the developed method showed the method to besuitable for descriptor selectionThe correlation study results(Figure 2) showed that bitterness increased upon theenhancement of the SPAN (Figure 2(a)) This finding is inagreement with previous findings [1ndash13] which suggested asignificant correlation between peptide size and bitternessThe investigation of the correlation of peptide subsets withthis descriptor showed less correlations in comparison withthe whole dataset Therefore SPAN can be regarded as ageneral descriptor rather than a specific one for peptidesubsets Thus SPAN can be used in the primary steps ofpeptide bioactive compound discovery to recognize bittercompound fractions

Mean square distance (MSD) index (Balaban) [28 41]contributes negatively to bitterness (Figure 2(b)) MSD de-creases with the increasing of molecular branching in anisomeric set [41] Furthermore it decreases with an increasein the number of atomsThis is in agreementwith the findingson the increase in bitterness following an increase in thenumber of amino acids in the studied peptides The lowercorrelation coefficient between peptide subsets by MSD incomparison with the total dataset showed that MSD isa general bitterness indicator rather than a subset bitternessdescriptor

8 BioMed Research International

Table 3 GA-PLS selected descriptors along with descriptive fortraining data set (181 peptides)

Name Range Minimum Maximum Mean Std deviationrdf025p 11270 407 11677 3090 2113mor15u 381 minus190 191 018 065eeig08x 421 minus024 397 216 121mor11m 213 minus121 092 minus015 040rdf025v 10821 400 11220 2952 2028mor15e 373 minus157 216 026 070mor21m 386 minus393 minus008 minus090 064belp7 188 000 188 118 042rdf075m 14442 000 14442 1628 2418mor31p 185 014 198 054 036ggi6 387 000 387 077 074mor11v 254 minus100 155 010 040mor31v 159 011 170 045 031ncs 2900 100 3000 726 550ti1 621722 minus8725 612997 25545 69986e3s 064 000 064 021 010gmti 131885700 34700 131920400 8106404 17363751vep1 818 261 1078 528 163ats1p 288 181 468 312 060smtiv 77560100 34100 77594200 4840738 10318764behv6 284 096 379 282 053alogp 807 minus281 526 038 145rtp a 2563 490 3053 1302 531c002 1800 000 1800 413 338j3d 1070 241 1310 535 246idmt 142611569 18231 142629800 7586086 18440059mats2m 032 minus016 016 004 006mor11p 280 minus129 151 005 043mats2v 142611569 18231 142629800 7586086 18440059hats8u 070 000 070 033 013smti 47759800 16800 47776600 2905336 6331178mor05u 3643 minus3823 minus180 minus1014 651hats8e 068 000 068 034 013mats2e 031 minus015 015 003 006l1u 5345 241 5586 1298 933rbn 4600 100 4700 1161 803h6m 181 000 181 034 036rdf080e 38799 000 38799 3941 6182rtv a 2323 399 2721 1150 482

Bitterness increases with the enhancement of E3s(Figure 2(c)) E3s is one of the accessibility directionalWHIM indiceswhich isweighed by atomic electrotopologicalstates [28 41] Indeed this descriptor defines size shapepolarizability and the conformational properties of the stud-ied molecules together Its correlation with the bitterness isweaker than MSD and SPAN descriptors but its removalfrom the model decreases the correlation coefficient signif-icantly It is not a strong identifier for utilization in drugdevelopment processes

Table 4 Intercorrelation of final descriptors

log(1119879) SPAN Mor11v MSD HATS8u G3p E3slog(1119879) 100SPAN 079 100Mor11v 034 009 100MSD minus081 minus078 minus020 100HATS8u minus056 minus052 minus025 038 100G3p minus063 minus064 minus013 064 030 100E3s 041 022 031 minus038 minus016 minus025 100

G3p the 3rd-component symmetry directional WHIMindex (weighed by polarizability) [28 41] showed a negativecorrelation with bitterness (Figure 2(d)) Along with E3sthese descriptors contribute to the electrical properties of themolecule G3p decreases with the increase in the number ofamino acids In fact more bitter compounds possess largerG3p values Similar to the SPAN and MSD descriptors thecorrelations with subsets are less than the general datasetand this descriptor can be regarded as a general identifierHATS8u (Figure 2(e)) belongs to the GETAWAY descriptorsThese descriptors are molecular descriptors derived fromthe molecular influence matrix (MIM) [12 28] HATSindices known as spatial autocorrelation based descriptorsencode information nonstructural fragments [28 41] Theyare suitable for describing differences in a congeneric seriesof molecules The effective position of substituent andfragments in molecular space and information about themolecular size and shape can be encoded by unweightedHATS indices Moreover they are independent of moleculealignment

In summary GETAWAY descriptors encoded local infor-mation while WHIM descriptors related to the holisticinformation of the molecules thus their joint uses areadvised HATS8u decreases with an increase in molecularsize The negative relation of this descriptor with bitterness isin agreement with the findings about the size and bitternessrelation

3D-MoRSE descriptors (3D molecule representation ofstructures based on electron diffraction) are derived frominfrared spectra simulation using a generalized scatteringfunction [28] Mor11v weighed by van der Waals volumebelongs to these descriptors which can be regarded as anindicator of size mass and volume of the molecules Bydecreasing the size of the peptide the Mor11v values tendtoward zero (Figure 2(f)) In fact the absolute values ofMor11v decrease by the decrease in size of the molecule Theabsolute quantity of it worsens the model and it should beused in the model in the present form including negativeand positive values for different peptides The only similartrend was observed for alogp values of the peptides Thenegative Mor11v values belong to the molecules with neg-ative alogp values (less hydrophobic peptides) 3D-MoRSEdescriptors cannot encode the lipophilicity of the moleculesThe overall correlation of bitterness to this descriptor is posi-tive [28]

BioMed Research International 9

0

5

10

15

20

0 1 2 3 4 5 6

SPA

N

y = 21303x + 11869

R2 = 06217

log(1T)

(a)

0

01

02

03

04

05

0 1 2 3 4 5 6

MSD

y = minus00384x + 03733

R2 = 06613

log(1T)

(b)

0

02

04

06

08

0 1 2 3 4 5 6

E3S

y = 00419x + 00847

R2 = 01667

log(1T)(c)

G3p

01

01

02

02

03

0 1 2 3 4 5 6

y = minus00132x + 02095

R2 = 0396

log(1T)

(d)

00 2

0807060504030201

4 6

HAT

S8u

y = minus00744x + 05511

R2 = 0313

log(1T)(e)

0

1

2

15

05

0 2 4 6mor

11v

minus05

minus1

minus15

y = 01342x minus 03043

R2 = 01127

log(1T)

(f)

Figure 2 Correlation of selected descriptors with bitter activity (1198772 is coefficient of determination and is calculated using (A6) of theappendix)

35 Model Building Using Linear Model (MLR) The selecteddescriptors were used to develop six MLR equations contain-ing 1ndash6 descriptors The adjusted 1198772 showed that all descrip-tors improved the model fitting the 1198772 values accuratelydepicted the fitting improvement and the best-fitted modelcould be rewritten as follows

log( 1119879

) = 545 (plusmn063) + 010 (plusmn002) SPAN

+ 032 (plusmn009)Mor11v

minus 788 (plusmn125)MSD

minus 155 (plusmn030)HATS8u

minus 539 (plusmn21)G3p + 092 (plusmn035)E3s

1198772= 081 119865 = 12573

SEP = 008 MPD = 137 (plusmn132)

ND = 181(4)

where SEP stands for standard error of prediction and NDshows the number of data points The MPD for test andvalidation sets were 181 (plusmn155) (119873 = 36) and 169 (plusmn126)(119873 = 10) respectively The relative frequency analysis ofthe prediction errors showed that more than 50 of data canbe predicted by the prediction error of less than 15 whichis acceptable for biological measurements where the meanILRSD for bitter activities of 19 dipeptides were measured bydifferent research groups and were 121 (plusmn107) In additionthe IPD frequency trend (Figure 3) is similar for trainingand test sets The MPD for the peptides by log(1119879) valuesless than 20 was 243 (plusmn222) and for the peptides with

10 BioMed Research International

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

trai

ning

set

(a)

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

test

set

(b)

0

10

20

30

40

50

60

70

MLR SVM ANN

IPD

freq

uenc

y of

val

idat

ion

set

lt15

15ndash30gt30

(c)

Figure 3 IPD frequencies (IPD lt 15 IPD = 15ndash30 and IPD gt 30)of the training (top) test (middle) and validation (bottom) sets forMLR SVM and ANNmodels

log(1119879) more than 20 it was 111 (plusmn79) The highest IPDvalue (1000) in the training set was calculated for GR(log(1119879) = 100) and in the test set it was calculated forPGR (IPD = 654 and log(1119879) = 160) In other wordsthe developed model was not able to predict the bitterness ofless bitter peptides and the selected descriptors are specificfor the evaluation of the bitter peptidesrsquo interactions with thereceptor

LOO cross-validation was done using the PLS toolbox ofMATLAB software The 1199022 value and RMSE were 076 and044 respectivelyThe 1199022 values for LMO cross-validation arereported in Table 3 The ranges of 1199022 values were 061ndash088which was in an acceptable range compared with 1198772 values

Table 5 Leave-many-out cross-validation results for MLR model

Subset 1198772

1198772

adj 1199022

1 080 079 0882 080 079 0873 080 079 0864 080 080 0825 080 080 0826 081 080 0757 081 080 0788 082 081 0769 082 081 08410 082 082 061

Table 6 Chance correlation results

Shuffled 119884 1198772 Shuffled 119884 119877

2

1198841 001 1198846 0021198842 007 1198847 minus0021198843 001 1198848 0031198844 minus002 1198849 0011198845 001 11988410 000

The ranges of corresponding 1198772 and adjusted 1198772 for thedesired subsets were 080ndash082 and 079ndash082 respectively(see Table 5) Chance correlation (119884 randomization) analysiswas done using 10 times shuffled bitter activity and the results(1198772 = 001ndash010) rejected the possibility of fortune correlation(see Table 6)

36 Model Building Using Nonlinear Models (ANN and SVM)and Comparison with the Linear Model (MLR) The sixselected descriptors were introduced to ANN as input valuesand the bitter activity as output data and the networks weredeveloped using the Levenberg-Marquardt algorithm [38]The number of hidden layers was three In addition SVMmodels were developed using STATISTICA 7 software Usingthe training set three parameters of SVM were optimized by10-fold cross-validation The optimized values of 119862 120576 and 120574were 91 007 and 006 respectively The MPD values of theproposed models for training test and validation sets were138 168 and 169 for MLR 130 160 and 150 for SVM and131 161 and 148 for ANN all of which are in acceptableranges and there is no significant difference between thesesubsets The IPD frequencies (Figure 3) revealed that bothSVM and ANN methods produced more accurate resultsespecially for the test sets The plots of experimental versuspredicted values (Figure 4) confirmed the discussed results

37 External Validation of the Proposed Models Statisticalcriteria of external test sets are depicted in Table 7 Theresults show that the proposed models using linear andnonlinear models passed the proposed statistical criteria in

BioMed Research International 11

MLR

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

SVM

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

ANN

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

y = 0785x + 0625

R = 0892y = 0784x + 0596

R = 0899

y = 0800x + 0624

R = 0907

Figure 4 Experimental versus predicted plots for linear and nonlinear models (119877 is calculated using (A7) of the appendix)

Table 7 Statistical parameters for test sets of MLR SVM and ANNmodels

Statistical criteria MLR SVM ANN1198772gt 06 0723 0739 0767(1198772minus 1198772

0) 1198772lt 01 0001 0002 0004

085 le 119870 le 115 0999 1010 098910038161003816100381610038161198772

0minus 11987710158402

0

1003816100381610038161003816lt 03 0109 0113 0106

1198772

119898gt 05 0704 0712 0722

the literature and they are robust and valid for externalprediction

38 Comparison with Previous Model The only similarstudy was published by Kim and Li-Chan [3] They usedthe amino acid three 119911-scores along with three parameters(total hydrophobicity residue number and logmass value) todevelop a PLS model A comparison of the newly developedmodel with their model is summarized in Table 8 It shouldbe noted that the aim of this paper was to develop a generalmodel rather than different models for different subsets andreported correlation coefficients belong to the general modelwhich was computed for subsets The comparison of thecorresponding 119877 values showed that the developed modelcould represent the bitter activity variance of three and morepeptides better than the PLS model while for dipeptides

the PLS model produced more accurate results DevelopedSVM and ANN models resulted in more accurate results forthe total data set compared with both the PLS and MLRmodels (Tables 1 and 5 Figure 4)

4 Conclusion

General MLR SVM and ANN models were developed topredict the bitterness of 229 peptides and amino acidsThe capability of the MLR model to reveal the impact ofeach descriptor on bitter activity was its main advantagewhere more accurate predictions by SVM and ANN madethem suitable models for precise predictions Obviouslyindividual models (ie models developed for each peptidesubset) produced less prediction errors but consideringthe convenience of application of the general models suchmodels are preferred during the primary stages of peptideproduction and evaluation The developed models can beused in nutraceutical and pharmaceutical industries

Appendix

RMSE = radicsum(119884pred minus 119884exp(mean))

2

119873

(A1)

where 119873 denotes the number of data points and 119884pred and119884exp are the predicted and observed log 1119879 (119879 is the bitter

12 BioMed Research International

Table 8 Developed MLR model statistics for subsets of peptides compared with a previously developed model

Data set Developed model Previous modelND 119877

21198772

adj 119877 RMSE 119877 (PLS) 1198772 (PLS)

Dipeptidesa 76 051 047 072 041 063 040Dipeptidesb 45 062 056 079 044 091 083Dipeptidesc 47 059 053 077 042 085 072Three peptides 51 065 060 081 038 071 050Tetrapeptides 23 072 062 085 048 090 081Pentapeptides 12 089 080 094 037 088 077Hexapeptides 20 065 048 080 047 075 056Heptapeptides 16 079 068 089 038 095 090Octa-tetradecapeptides 24 057 042 076 044 mdash mdashWhole data setd 227 080 079 089 046 081 066Test and validation sets 46 076 073 087 056 mdash mdashTraining set 181 081 081 090 043 mdash mdashaAverage of experimental values was used when there were different values in different referencesbExperimental data were taken from different referencescExperimental data were taken from [3]dThe 119877 values for whole data set using SVM and ANNmethods are 090 and 091 respectively

threshold concentration (119872)) values

MSSReg =sum SSReg

df (A2)

MSSRes =sum SSRes

df (A3)

where SS is sum of squares MSS is mean square Reg isregression Res is residual and cv is cross validated

MSSTotal = MSSReg +MSSRes (A4)

119865 =

MSSRegMSSRes

(A5)

1198772= 1 minus

MSSResMSSTotal

(A6)

119877 = (1 minus

MSSResMSSTotal

)

05

(A7)

1199022= 1198772

cv = 1 minusMSSRescvMSSTotal

(A8)

Conflict of Interests

The authors certify that there is no financial relation betweenthem and the mentioned commercial identities in the paper

Acknowledgments

Theauthors thank the Biotechnology Research Center TabrizUniversity of Medical Sciences for a partial financial supportunder Grant no 585765

References

[1] A H Pripp and Y Ardo ldquoModelling relationship betweenangiotensin-(I)-converting enzyme inhibition and the bittertaste of peptidesrdquo Food Chemistry vol 102 no 3 pp 880ndash8882007

[2] A H Pripp T Isaksson L Stepaniak T Soslashrhaug and Y ArdoldquoQuantitative structure activity relationship modelling of pep-tides and proteins as a tool in food sciencerdquo Trends in FoodScience and Technology vol 16 no 11 pp 484ndash494 2005

[3] H O Kim and E C Y Li-Chan ldquoQuantitative structure-activityrelationship study of bitter peptidesrdquo Journal of Agricultural andFood Chemistry vol 54 no 26 pp 10102ndash10111 2006

[4] K Maehashi and L Huang ldquoBitter peptides and bitter tastereceptorsrdquo Cellular and Molecular Life Sciences vol 66 no 10pp 1661ndash1671 2009

[5] E G Walsh B E Adamczyk K B Chalasani et al ldquoOral deliv-ery of macromolecules rationale underpinning gastrointestinalpermeation enhancement technology (GIPET)rdquo TherapeuticDelivery vol 2 no 12 pp 1595ndash1610 2011

[6] R C Pinto C Silva N Martinho et al ldquoDrug carriers for oraldelivery of peptides and proteins accomplishments and futureperspectivesrdquoTherapeutic Delivery vol 4 pp 251ndash265 2013

[7] J Renukuntla A D Vadlapudi A Patel et al ldquoApproachesfor enhancing oral bioavailability of peptides and proteinsrdquoInternational Journal of Pharmaceutics vol 447 pp 75ndash93 2013

[8] S Swain ldquoStabilization and delivery approaches for protein andpeptide pharmaceuticals an extensive review of patentsrdquoRecentPatents on Biotechnology In press

[9] K Maehashi M Matano H Wang L A Vo Y Yamamoto andL Huang ldquoBitter peptides activate hTAS2Rs the human bitterreceptorsrdquo Biochemical and Biophysical Research Communica-tions vol 365 no 4 pp 851ndash855 2008

[10] K H Ney ldquoPrediction of bitterness of peptides from theiramino acid compositionrdquo Zeitschrift fur Lebensmittel-Untersu-chung und -Forschung vol 147 no 2 pp 64ndash68 1971

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003

4 BioMed Research International

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDYP 170 251 478 172 10 157 79PK 223 206 78 227 17 210 59LD 223 219 17 226 13 209 62AD 223 215 37 222 05 200 105LE 252 215 146 233 75 213 155GE 283 213 248 263 72 258 88VL 211 204 32 247 169 237 124GP 179 228 273 209 167 194 86FG 200 224 121 225 125 226 128PF 214 228 64 234 93 229 69GF 236 225 45 230 25 224 52GY 215 214 04 212 15 217 08LF 282 252 105 270 42 255 94FL 285 258 94 264 74 257 97GL 164 180 96 170 40 175 66LG 171 170 07 158 77 170 06IG 201 171 151 169 160 170 156GI 217 191 121 182 163 184 154LL 247 221 105 233 56 208 160II 254 205 191 224 119 205 191IL 254 212 166 229 99 208 182RGP 190 251 318 166 124 183 39FGG 234 227 32 210 103 209 109GFG 252 235 67 225 109 210 165PPP 270 239 116 262 31 242 103GLL 283 237 164 233 176 207 267RPG 310 271 127 246 205 237 235LGG 100 168 676 175 750 167 671GGL 200 199 04 200 02 192 41LGL 230 238 33 243 58 234 17LLG 230 235 21 239 40 231 03PIP 285 237 167 371 301 351 232LLL 292 270 75 306 48 294 09GYG 170 172 13 263 545 242 422GVV 234 205 124 298 273 307 313VVV 234 238 18 176 249 164 298PPF 263 248 57 338 284 338 285YGG 263 260 13 267 16 246 63RPF 283 328 159 255 98 244 138FGF 292 285 24 324 111 329 125DLL 310 318 26 263 152 267 137GFF 323 304 59 314 28 308 47FPF 340 331 28 338 06 325 44GYY 340 302 112 315 74 306 99PFP 340 288 153 274 195 258 242YYY 370 345 68 373 07 357 35GGV 148 193 302 175 184 179 210PGG 234 169 276 202 136 200 146GGF 283 242 144 239 154 236 168GGP 204 195 44 260 273 251 231

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDPPG 204 222 86 235 154 216 57FFG 265 293 105 306 155 299 128ELL 340 291 145 273 197 282 170FFF 370 344 71 428 157 394 64EGG 283 206 272 271 42 261 79YGY 310 284 85 287 74 279 100YPF 352 303 139 318 96 304 137GGLG 160 192 198 155 33 134 162FGFG 352 294 164 355 10 356 12GPFF 380 321 156 319 160 315 170RPGF 380 339 109 341 102 336 116FGGF 392 308 215 332 153 32 183RPFF 440 382 132 442 04 424 36GLGG 170 184 82 170 02 174 23LGGG 190 198 41 226 189 192 08PFPP 234 314 343 239 19 241 28GPPF 252 288 142 256 17 247 19RRPP 27 314 164 300 112 311 153VYPF 352 346 16 297 157 305 135PFIV 352 312 115 328 69 333 55FFPR 400 387 32 379 52 387 33FFPP 252 327 296 251 03 243 35FFPE 276 356 288 349 263 355 286GGFF 285 348 220 239 161 234 181FFPG 290 338 166 337 163 335 156LLLL 323 335 37 318 16 323 00FFGG 252 320 270 241 44 231 82RRPFF 470 415 117 469 03 463 14GGGLG 190 228 199 234 231 208 94GLGGG 190 215 129 177 70 169 108LGGGG 190 216 136 251 323 249 311GGVVV 211 284 346 221 47 202 45FFPGG 283 358 266 318 124 318 122PGPIP 311 325 46 326 48 319 26RGPPF 263 345 312 295 122 289 98PPFIV 292 330 129 264 96 269 79RPGFF 351 382 88 398 134 386 99RPGGFF 404 395 23 428 58 406 06PFPGPI 336 363 81 365 87 356 60RPFFGG 392 411 47 422 76 407 39RGPPGF 352 331 59 355 09 343 26RGPPFI 460 360 217 347 247 352 234RGPFIV 430 384 107 383 109 376 126RGGFIV 310 335 80 302 26 289 67RGPPFF 423 404 46 413 24 398 59RRPPGF 440 373 153 479 90 475 79RRPPFF 515 413 199 508 14 540 49GPPFIV 292 340 166 328 124 320 95GGFFGG 370 402 88 372 04 355 40KPPFIV 382 356 68 342 104 351 82

BioMed Research International 5

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDGGRPFF 404 407 06 398 15 383 51PVLGPV 330 325 16 326 12 321 28FPPFIV 352 366 40 418 188 409 163RGPPGGV 248 307 239 293 182 296 192RGPPGFF 440 397 97 404 83 393 107RGPPFFF 500 445 111 462 76 447 106VIIPFPG 360 408 133 349 31 35 28VIFPPGR 410 433 55 387 56 383 66RGPPGIG 278 349 255 341 227 343 235RGPPGGF 308 325 54 317 30 310 08YPFPGPI 380 409 77 421 107 405 66RPPPFFF 470 473 06 441 62 422 102VIPFPGR 415 413 05 438 56 423 19PFPGPIP 360 354 16 400 112 398 107RGPPGFG 368 356 31 347 57 349 52RGPFPIV 395 385 25 392 07 388 19RPFFRPFF 500 511 22 496 09 503 06RGPKPIIV 408 387 50 421 32 383 63VYPFPPGI 382 423 107 421 101 415 87RGPEPIIV 451 387 142 412 87 380 158RGPPGGFF 411 337 180 328 201 328 203GGRPFFGG 440 430 23 392 108 381 133RGPPGGGFF 395 405 26 424 74 409 35GGRGPPFIV 410 395 36 435 61 430 49RGPPFIVGG 431 406 57 404 63 401 71FFRPFFRPFF 515 549 67 421 182 406 211PVRGPFPIIV 540 416 229 482 108 418 226VYPFPPGINH 430 448 43 451 49 438 20VYPFPPGIGG 352 420 192 308 124 300 149VYPFGGGINH 364 396 87 401 101 388 66RPFFRPFFRPFF 500 544 89 504 08 522 44RGPPFIVRGPPFIV 440 492 118 387 120 375 148PVLGPVRGPFPIIV 483 436 97 511 59 448 72

Test data setRG 211 197 64 229 84 201 46AV 116 171 476 159 374 158 363ID 137 207 514 207 512 197 439VD 190 194 22 236 245 215 134LV 223 189 151 175 216 182 184VI 223 190 147 199 109 196 121FP 277 254 83 265 43 246 111FF 301 275 86 289 40 282 65AF 181 241 334 255 407 239 318PGR 160 265 654 260 623 265 659RRR 240 355 480 352 468 355 478FIV 283 277 21 268 52 271 42GGY 283 244 138 255 97 245 135GRP 310 264 148 254 181 264 150YYG 320 310 32 317 09 311 29KPF 340 318 64 314 76 311 85GLG 200 158 213 157 215 169 157

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDGPG 170 209 232 211 240 201 181KPK 252 297 177 282 118 302 199VYP 252 297 177 287 139 290 150GGGL 234 226 32 239 21 210 100RPFG 341 347 18 337 11 334 19RGFF 380 369 30 395 40 383 07GGLGG 190 229 206 193 17 192 09GGGGL 265 221 166 240 95 217 179PGPGPG 260 310 192 314 206 309 190PFPIIV 390 343 120 355 90 340 127VIFPPG 268 380 418 371 384 366 366RGPPFIV 430 386 103 375 127 378 121VYPFPPG 352 406 154 409 161 395 122RGPFPIIV 540 414 233 440 185 414 234RGPGPIIV 481 384 203 426 115 388 192RRPPPFFF 570 478 162 476 165 469 176VIIPFPGR 385 454 180 447 161 451 171VYPFPPIGNH 430 446 37 448 41 438 18GGRGPPFIVGG 440 421 43 421 44 411 66

Validation data setIN 149 211 419 212 424 201 348WW 360 299 169 318 117 305 153GV 174 222 278 219 258 208 198FPK 252 310 229 294 167 314 247FPP 234 265 133 256 96 262 121PGP 204 256 254 257 260 241 180VIF 289 278 38 279 36 282 23RPPFIV 410 374 87 361 120 365 110RGPPFGG 323 337 45 323 01 334 33RGPPFIIV 430 412 41 420 24 399 72

of the training and test sets is one of the most importantsteps in model development and should be done so thateach set reflects the original data set as much as possible 119870-means clustering is one of the most frequently used methodsof data set splitting This procedure identifies relativelyhomogeneous groups of molecules (each observation has thenearest distance to the mean of the cluster) based on selectedproperties (biologic activities and structural parameters)Using this method we divided the data set into 10 clustersand three subsets (ie training test and validation sets) wereselected from themThe validation set was excluded from thestudy before the descriptor selection step and the test set wasexcluded before the model development step

25 Descriptor Selection In order to reduce the dimensionof the variable matrix a correlation analysis was carriedout and both the highly correlated and constant variableswere excluded A home-developed MATLAB toolbox wasemployed that considered the following criteria for excludinga variable [1] the intercorrelation with other descriptorshigher than 099 [2] the correlation with activity lower than

6 BioMed Research International

Bitterness

Total descriptors

(1) Removing constantdescriptors

(2) Removing highlycorrelated descriptors

Descriptors Descriptors

Descriptors

Training test andvalidation sets selection

using K-means clustering

Fina

l sele

cted

de

scrip

tors

Stepwise regressiondescriptor selection G

A-PL

S se

lect

edde

scrip

tors

GA-PLS descriptorselection

Model development(1) MLR(2) SVM(3) ANN

Prediction of thebitterness for training test

and validation sets

Validation of thedeveloped modelsaccording to the

guidelines

Figure 1 Procedure of descriptor selection QSBR model development and validation processes

intercorrelated descriptors and [3] the frequency of repeatedvalues for a descriptor (lower than 10 of cases)

After this step a genetic algorithm-partial least square(GA-PLS) algorithm [36] was used to select the most signif-icant variables GA-PLS (a combination of genetic algorithmand PLS regression) was also developed and utilized as avariable selection method in QSAR and QSPR studies byLeardi [36] and applied for QSAR studies [37 38] TheMATLAB 78 software was used to run the GA-PLS methoddeveloped by Leardi The variables were divided into sub-groups (containing up to 200 variables) and each subgroupwith the corresponding log 1119879 values (119884) was introduced tothe algorithm as input The output (containing descriptorsscoring based on the cross-validated percentage of explainedvariance) was produced after 100 runs As the results of theGA-PLS were a bit different for each run the top 50 scoresof each subgroup were combined with the next 100 variablesand this step was repeated up to the final step A comparativestudy was done in this study to check the capability of GA-PLS for descriptor selection in comparison with commonmethods (ie PLS and stepwise) Results showed that GA-PLS was able to extract more significant descriptorsThe onlydrawback of this method was the scoring differences betweendifferent runs which could be solved using the mentionedprocedure (ie selection of 50 of each run instead of limiteddescriptors [37])

Twenty percent of high score variables of the final stepwere selected as the most significant variables and were fur-ther studied by stepwise regression and bivariate correlationanalysis in which the selected variables using stepwise regres-sionwere investigated concerning their intercorrelationsTheless intercorrelated variables were selected for the final modeldevelopment process The complete procedure of descriptorselection is summarized in Figure 1

26 Model Building Linear and nonlinear models weredeveloped using the selected descriptors The details of themodel development and validation process are discussed inthe next sections

27 Linear Model Using Multiple Linear Regression (MLR)The selected parameters were used for developing QSBRequations using the multiple linear regression correlation(MLR) method and the goodness of fit and statistical sig-nificance of the models were evaluated using 1198772 (coefficientof determination) 119865 (variance ratio) and the MPD (meanpercentage deviation) values calculated using

MPD = 100119873

sum

100381610038161003816100381610038161003816100381610038161003816

119884pred minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(1)

The relative frequency of individual percentage deviations(IPD) was studied in order to define the model predictioncapacity for each data point The IPD was calculated using

IPD = 100100381610038161003816100381610038161003816100381610038161003816

119884pred minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(2)

In order to compare the MPD values with the relativestandard deviations (RSD) between experimental bitter val-ues measured by different research groups (inter laboratoryrelative standard deviations (ILRSD)) the ILRSD values forthe available data were computed using

ILRSD = 100119873

100381610038161003816100381610038161003816100381610038161003816

119884exp minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(3)

BioMed Research International 7

28 Nonlinear Models Using Support Vector Regression (SVR)and Artificial Neural Network (ANN) In the next stage theselected descriptors were used to derive non-linear modelsusing SVR and ANN To construct an ANN the Levenberg-Marquardt algorithm [39] was used by nftool toolbox ofMATLAB 78 software to train the network Selected descrip-tors and the bitter activity of training sets were introducedas input and output values respectively The training set wasrandomly classified for training validation and test sets inorder to avoid overfitting and then networks were trainedSVR another non-linear model constructs a hyperplane ina multidimensional space that provides minimum error byemploying a non-linear Kernel function Some parameterssuch as capacity parameter (119862) 120576 are related to type of noisein the data and 120574 is related to radial base function (RBF)which is the most common type of Kernel functions TheSVMmodel and optimization of parameters were done usingSTATISTICA 7 software

29 Model Validation The developed models were evaluatedusing the leave-many-out (LMO) cross-validation methodThe 1199022 values were calculated using (A8) of the appendix

210 Chance Correlation To check the possibility of a chancecorrelation 10 times shuffled activities were correlated tothe variables and the produced regression coefficients werecompared with the developed model regression coefficient

211 External Validation of Proposed Models Using the Exter-nal Test Set A set of statistical criteria [30 31] was employedto analyze 36 data points of the test set

(1) 1198772 gt 06

where 1198772 is the coefficient of determination between thepredicted and observed values

(2) (1198772 minus 11987720)1198772lt 01 and 085 le 119870 le 115 or (1198772 minus

11987710158402

0)1198772lt 01 and 085 le 1198701015840 le 115

where 11987720is the coefficient of determination obtained using

predicted values relative to a regression line fit for experimen-tal values and required to pass through the origin and 11987710158402

0

is the corresponding coefficient obtained using experimentalvalues relative to a regression line fit for predicted values andrequired to pass through the origin 119870 and 1198701015840 are the slopesof regression lines through the origin for fits for experimentaland predicted data respectively

(3) |11987720minus 11987710158402

0| lt 03

In addition another criterion proposed by P P Roy andK Roy [32] was considered as

(4) 1198772119898= 1198772(1 minus radic119877

2minus 1198772

0)

in which 1198772119898gt 05 indicates the good external predictability

of the QSAR models

3 Results and Discussion

31 Outlier Detection According to standard score analysisthere is no outlier The PCA map of scores showed that mostof the data points fell into the acceptable data space and thereis no outlier in the studied data set

32 Training-Test Selection The selected training test andvalidation data sets are listed in Table 2 along with thecomputed accuracy criteria The activity ranges for trainingtest and validation sets were 100ndash540 116ndash570 and 149ndash540 respectivelyThe number of data points for the trainingtest and validation sets was 181 36 and 10 respectively

33 Descriptor Selection A total of 1292 descriptors were cal-culated using Dragon 54 softwareThis number was reducedto 244 descriptors after correlation analysis was performedusing a home-developed toolbox The remaining descriptorswere analyzed using the GA-PLS method and the best scor-ing descriptors (descriptors included inTable 3)were selectedfor further analysis using stepwise regression and bivariatecross-correlation studies (the intercorrelation between theselected descriptors was less than 09) After excluding thenonsignificant descriptors the remaining descriptors werechecked to find the possible intercorrelation and the finaldescriptors were selected using a stepwise regression method(see Table 4) The final six selected descriptors were againchecked for intercorrelation Six descriptors are suitable forthe construction of a QSAR model [40]

34 Evaluation of the Selected Descriptors SPAN belongsto the size descriptors which evaluate the dimension ofthe molecule and often calculate it from the moleculargeometry This descriptor is a suitable size descriptor formacromolecules [41] and its selection for the studied dataset with the developed method showed the method to besuitable for descriptor selectionThe correlation study results(Figure 2) showed that bitterness increased upon theenhancement of the SPAN (Figure 2(a)) This finding is inagreement with previous findings [1ndash13] which suggested asignificant correlation between peptide size and bitternessThe investigation of the correlation of peptide subsets withthis descriptor showed less correlations in comparison withthe whole dataset Therefore SPAN can be regarded as ageneral descriptor rather than a specific one for peptidesubsets Thus SPAN can be used in the primary steps ofpeptide bioactive compound discovery to recognize bittercompound fractions

Mean square distance (MSD) index (Balaban) [28 41]contributes negatively to bitterness (Figure 2(b)) MSD de-creases with the increasing of molecular branching in anisomeric set [41] Furthermore it decreases with an increasein the number of atomsThis is in agreementwith the findingson the increase in bitterness following an increase in thenumber of amino acids in the studied peptides The lowercorrelation coefficient between peptide subsets by MSD incomparison with the total dataset showed that MSD isa general bitterness indicator rather than a subset bitternessdescriptor

8 BioMed Research International

Table 3 GA-PLS selected descriptors along with descriptive fortraining data set (181 peptides)

Name Range Minimum Maximum Mean Std deviationrdf025p 11270 407 11677 3090 2113mor15u 381 minus190 191 018 065eeig08x 421 minus024 397 216 121mor11m 213 minus121 092 minus015 040rdf025v 10821 400 11220 2952 2028mor15e 373 minus157 216 026 070mor21m 386 minus393 minus008 minus090 064belp7 188 000 188 118 042rdf075m 14442 000 14442 1628 2418mor31p 185 014 198 054 036ggi6 387 000 387 077 074mor11v 254 minus100 155 010 040mor31v 159 011 170 045 031ncs 2900 100 3000 726 550ti1 621722 minus8725 612997 25545 69986e3s 064 000 064 021 010gmti 131885700 34700 131920400 8106404 17363751vep1 818 261 1078 528 163ats1p 288 181 468 312 060smtiv 77560100 34100 77594200 4840738 10318764behv6 284 096 379 282 053alogp 807 minus281 526 038 145rtp a 2563 490 3053 1302 531c002 1800 000 1800 413 338j3d 1070 241 1310 535 246idmt 142611569 18231 142629800 7586086 18440059mats2m 032 minus016 016 004 006mor11p 280 minus129 151 005 043mats2v 142611569 18231 142629800 7586086 18440059hats8u 070 000 070 033 013smti 47759800 16800 47776600 2905336 6331178mor05u 3643 minus3823 minus180 minus1014 651hats8e 068 000 068 034 013mats2e 031 minus015 015 003 006l1u 5345 241 5586 1298 933rbn 4600 100 4700 1161 803h6m 181 000 181 034 036rdf080e 38799 000 38799 3941 6182rtv a 2323 399 2721 1150 482

Bitterness increases with the enhancement of E3s(Figure 2(c)) E3s is one of the accessibility directionalWHIM indiceswhich isweighed by atomic electrotopologicalstates [28 41] Indeed this descriptor defines size shapepolarizability and the conformational properties of the stud-ied molecules together Its correlation with the bitterness isweaker than MSD and SPAN descriptors but its removalfrom the model decreases the correlation coefficient signif-icantly It is not a strong identifier for utilization in drugdevelopment processes

Table 4 Intercorrelation of final descriptors

log(1119879) SPAN Mor11v MSD HATS8u G3p E3slog(1119879) 100SPAN 079 100Mor11v 034 009 100MSD minus081 minus078 minus020 100HATS8u minus056 minus052 minus025 038 100G3p minus063 minus064 minus013 064 030 100E3s 041 022 031 minus038 minus016 minus025 100

G3p the 3rd-component symmetry directional WHIMindex (weighed by polarizability) [28 41] showed a negativecorrelation with bitterness (Figure 2(d)) Along with E3sthese descriptors contribute to the electrical properties of themolecule G3p decreases with the increase in the number ofamino acids In fact more bitter compounds possess largerG3p values Similar to the SPAN and MSD descriptors thecorrelations with subsets are less than the general datasetand this descriptor can be regarded as a general identifierHATS8u (Figure 2(e)) belongs to the GETAWAY descriptorsThese descriptors are molecular descriptors derived fromthe molecular influence matrix (MIM) [12 28] HATSindices known as spatial autocorrelation based descriptorsencode information nonstructural fragments [28 41] Theyare suitable for describing differences in a congeneric seriesof molecules The effective position of substituent andfragments in molecular space and information about themolecular size and shape can be encoded by unweightedHATS indices Moreover they are independent of moleculealignment

In summary GETAWAY descriptors encoded local infor-mation while WHIM descriptors related to the holisticinformation of the molecules thus their joint uses areadvised HATS8u decreases with an increase in molecularsize The negative relation of this descriptor with bitterness isin agreement with the findings about the size and bitternessrelation

3D-MoRSE descriptors (3D molecule representation ofstructures based on electron diffraction) are derived frominfrared spectra simulation using a generalized scatteringfunction [28] Mor11v weighed by van der Waals volumebelongs to these descriptors which can be regarded as anindicator of size mass and volume of the molecules Bydecreasing the size of the peptide the Mor11v values tendtoward zero (Figure 2(f)) In fact the absolute values ofMor11v decrease by the decrease in size of the molecule Theabsolute quantity of it worsens the model and it should beused in the model in the present form including negativeand positive values for different peptides The only similartrend was observed for alogp values of the peptides Thenegative Mor11v values belong to the molecules with neg-ative alogp values (less hydrophobic peptides) 3D-MoRSEdescriptors cannot encode the lipophilicity of the moleculesThe overall correlation of bitterness to this descriptor is posi-tive [28]

BioMed Research International 9

0

5

10

15

20

0 1 2 3 4 5 6

SPA

N

y = 21303x + 11869

R2 = 06217

log(1T)

(a)

0

01

02

03

04

05

0 1 2 3 4 5 6

MSD

y = minus00384x + 03733

R2 = 06613

log(1T)

(b)

0

02

04

06

08

0 1 2 3 4 5 6

E3S

y = 00419x + 00847

R2 = 01667

log(1T)(c)

G3p

01

01

02

02

03

0 1 2 3 4 5 6

y = minus00132x + 02095

R2 = 0396

log(1T)

(d)

00 2

0807060504030201

4 6

HAT

S8u

y = minus00744x + 05511

R2 = 0313

log(1T)(e)

0

1

2

15

05

0 2 4 6mor

11v

minus05

minus1

minus15

y = 01342x minus 03043

R2 = 01127

log(1T)

(f)

Figure 2 Correlation of selected descriptors with bitter activity (1198772 is coefficient of determination and is calculated using (A6) of theappendix)

35 Model Building Using Linear Model (MLR) The selecteddescriptors were used to develop six MLR equations contain-ing 1ndash6 descriptors The adjusted 1198772 showed that all descrip-tors improved the model fitting the 1198772 values accuratelydepicted the fitting improvement and the best-fitted modelcould be rewritten as follows

log( 1119879

) = 545 (plusmn063) + 010 (plusmn002) SPAN

+ 032 (plusmn009)Mor11v

minus 788 (plusmn125)MSD

minus 155 (plusmn030)HATS8u

minus 539 (plusmn21)G3p + 092 (plusmn035)E3s

1198772= 081 119865 = 12573

SEP = 008 MPD = 137 (plusmn132)

ND = 181(4)

where SEP stands for standard error of prediction and NDshows the number of data points The MPD for test andvalidation sets were 181 (plusmn155) (119873 = 36) and 169 (plusmn126)(119873 = 10) respectively The relative frequency analysis ofthe prediction errors showed that more than 50 of data canbe predicted by the prediction error of less than 15 whichis acceptable for biological measurements where the meanILRSD for bitter activities of 19 dipeptides were measured bydifferent research groups and were 121 (plusmn107) In additionthe IPD frequency trend (Figure 3) is similar for trainingand test sets The MPD for the peptides by log(1119879) valuesless than 20 was 243 (plusmn222) and for the peptides with

10 BioMed Research International

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

trai

ning

set

(a)

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

test

set

(b)

0

10

20

30

40

50

60

70

MLR SVM ANN

IPD

freq

uenc

y of

val

idat

ion

set

lt15

15ndash30gt30

(c)

Figure 3 IPD frequencies (IPD lt 15 IPD = 15ndash30 and IPD gt 30)of the training (top) test (middle) and validation (bottom) sets forMLR SVM and ANNmodels

log(1119879) more than 20 it was 111 (plusmn79) The highest IPDvalue (1000) in the training set was calculated for GR(log(1119879) = 100) and in the test set it was calculated forPGR (IPD = 654 and log(1119879) = 160) In other wordsthe developed model was not able to predict the bitterness ofless bitter peptides and the selected descriptors are specificfor the evaluation of the bitter peptidesrsquo interactions with thereceptor

LOO cross-validation was done using the PLS toolbox ofMATLAB software The 1199022 value and RMSE were 076 and044 respectivelyThe 1199022 values for LMO cross-validation arereported in Table 3 The ranges of 1199022 values were 061ndash088which was in an acceptable range compared with 1198772 values

Table 5 Leave-many-out cross-validation results for MLR model

Subset 1198772

1198772

adj 1199022

1 080 079 0882 080 079 0873 080 079 0864 080 080 0825 080 080 0826 081 080 0757 081 080 0788 082 081 0769 082 081 08410 082 082 061

Table 6 Chance correlation results

Shuffled 119884 1198772 Shuffled 119884 119877

2

1198841 001 1198846 0021198842 007 1198847 minus0021198843 001 1198848 0031198844 minus002 1198849 0011198845 001 11988410 000

The ranges of corresponding 1198772 and adjusted 1198772 for thedesired subsets were 080ndash082 and 079ndash082 respectively(see Table 5) Chance correlation (119884 randomization) analysiswas done using 10 times shuffled bitter activity and the results(1198772 = 001ndash010) rejected the possibility of fortune correlation(see Table 6)

36 Model Building Using Nonlinear Models (ANN and SVM)and Comparison with the Linear Model (MLR) The sixselected descriptors were introduced to ANN as input valuesand the bitter activity as output data and the networks weredeveloped using the Levenberg-Marquardt algorithm [38]The number of hidden layers was three In addition SVMmodels were developed using STATISTICA 7 software Usingthe training set three parameters of SVM were optimized by10-fold cross-validation The optimized values of 119862 120576 and 120574were 91 007 and 006 respectively The MPD values of theproposed models for training test and validation sets were138 168 and 169 for MLR 130 160 and 150 for SVM and131 161 and 148 for ANN all of which are in acceptableranges and there is no significant difference between thesesubsets The IPD frequencies (Figure 3) revealed that bothSVM and ANN methods produced more accurate resultsespecially for the test sets The plots of experimental versuspredicted values (Figure 4) confirmed the discussed results

37 External Validation of the Proposed Models Statisticalcriteria of external test sets are depicted in Table 7 Theresults show that the proposed models using linear andnonlinear models passed the proposed statistical criteria in

BioMed Research International 11

MLR

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

SVM

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

ANN

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

y = 0785x + 0625

R = 0892y = 0784x + 0596

R = 0899

y = 0800x + 0624

R = 0907

Figure 4 Experimental versus predicted plots for linear and nonlinear models (119877 is calculated using (A7) of the appendix)

Table 7 Statistical parameters for test sets of MLR SVM and ANNmodels

Statistical criteria MLR SVM ANN1198772gt 06 0723 0739 0767(1198772minus 1198772

0) 1198772lt 01 0001 0002 0004

085 le 119870 le 115 0999 1010 098910038161003816100381610038161198772

0minus 11987710158402

0

1003816100381610038161003816lt 03 0109 0113 0106

1198772

119898gt 05 0704 0712 0722

the literature and they are robust and valid for externalprediction

38 Comparison with Previous Model The only similarstudy was published by Kim and Li-Chan [3] They usedthe amino acid three 119911-scores along with three parameters(total hydrophobicity residue number and logmass value) todevelop a PLS model A comparison of the newly developedmodel with their model is summarized in Table 8 It shouldbe noted that the aim of this paper was to develop a generalmodel rather than different models for different subsets andreported correlation coefficients belong to the general modelwhich was computed for subsets The comparison of thecorresponding 119877 values showed that the developed modelcould represent the bitter activity variance of three and morepeptides better than the PLS model while for dipeptides

the PLS model produced more accurate results DevelopedSVM and ANN models resulted in more accurate results forthe total data set compared with both the PLS and MLRmodels (Tables 1 and 5 Figure 4)

4 Conclusion

General MLR SVM and ANN models were developed topredict the bitterness of 229 peptides and amino acidsThe capability of the MLR model to reveal the impact ofeach descriptor on bitter activity was its main advantagewhere more accurate predictions by SVM and ANN madethem suitable models for precise predictions Obviouslyindividual models (ie models developed for each peptidesubset) produced less prediction errors but consideringthe convenience of application of the general models suchmodels are preferred during the primary stages of peptideproduction and evaluation The developed models can beused in nutraceutical and pharmaceutical industries

Appendix

RMSE = radicsum(119884pred minus 119884exp(mean))

2

119873

(A1)

where 119873 denotes the number of data points and 119884pred and119884exp are the predicted and observed log 1119879 (119879 is the bitter

12 BioMed Research International

Table 8 Developed MLR model statistics for subsets of peptides compared with a previously developed model

Data set Developed model Previous modelND 119877

21198772

adj 119877 RMSE 119877 (PLS) 1198772 (PLS)

Dipeptidesa 76 051 047 072 041 063 040Dipeptidesb 45 062 056 079 044 091 083Dipeptidesc 47 059 053 077 042 085 072Three peptides 51 065 060 081 038 071 050Tetrapeptides 23 072 062 085 048 090 081Pentapeptides 12 089 080 094 037 088 077Hexapeptides 20 065 048 080 047 075 056Heptapeptides 16 079 068 089 038 095 090Octa-tetradecapeptides 24 057 042 076 044 mdash mdashWhole data setd 227 080 079 089 046 081 066Test and validation sets 46 076 073 087 056 mdash mdashTraining set 181 081 081 090 043 mdash mdashaAverage of experimental values was used when there were different values in different referencesbExperimental data were taken from different referencescExperimental data were taken from [3]dThe 119877 values for whole data set using SVM and ANNmethods are 090 and 091 respectively

threshold concentration (119872)) values

MSSReg =sum SSReg

df (A2)

MSSRes =sum SSRes

df (A3)

where SS is sum of squares MSS is mean square Reg isregression Res is residual and cv is cross validated

MSSTotal = MSSReg +MSSRes (A4)

119865 =

MSSRegMSSRes

(A5)

1198772= 1 minus

MSSResMSSTotal

(A6)

119877 = (1 minus

MSSResMSSTotal

)

05

(A7)

1199022= 1198772

cv = 1 minusMSSRescvMSSTotal

(A8)

Conflict of Interests

The authors certify that there is no financial relation betweenthem and the mentioned commercial identities in the paper

Acknowledgments

Theauthors thank the Biotechnology Research Center TabrizUniversity of Medical Sciences for a partial financial supportunder Grant no 585765

References

[1] A H Pripp and Y Ardo ldquoModelling relationship betweenangiotensin-(I)-converting enzyme inhibition and the bittertaste of peptidesrdquo Food Chemistry vol 102 no 3 pp 880ndash8882007

[2] A H Pripp T Isaksson L Stepaniak T Soslashrhaug and Y ArdoldquoQuantitative structure activity relationship modelling of pep-tides and proteins as a tool in food sciencerdquo Trends in FoodScience and Technology vol 16 no 11 pp 484ndash494 2005

[3] H O Kim and E C Y Li-Chan ldquoQuantitative structure-activityrelationship study of bitter peptidesrdquo Journal of Agricultural andFood Chemistry vol 54 no 26 pp 10102ndash10111 2006

[4] K Maehashi and L Huang ldquoBitter peptides and bitter tastereceptorsrdquo Cellular and Molecular Life Sciences vol 66 no 10pp 1661ndash1671 2009

[5] E G Walsh B E Adamczyk K B Chalasani et al ldquoOral deliv-ery of macromolecules rationale underpinning gastrointestinalpermeation enhancement technology (GIPET)rdquo TherapeuticDelivery vol 2 no 12 pp 1595ndash1610 2011

[6] R C Pinto C Silva N Martinho et al ldquoDrug carriers for oraldelivery of peptides and proteins accomplishments and futureperspectivesrdquoTherapeutic Delivery vol 4 pp 251ndash265 2013

[7] J Renukuntla A D Vadlapudi A Patel et al ldquoApproachesfor enhancing oral bioavailability of peptides and proteinsrdquoInternational Journal of Pharmaceutics vol 447 pp 75ndash93 2013

[8] S Swain ldquoStabilization and delivery approaches for protein andpeptide pharmaceuticals an extensive review of patentsrdquoRecentPatents on Biotechnology In press

[9] K Maehashi M Matano H Wang L A Vo Y Yamamoto andL Huang ldquoBitter peptides activate hTAS2Rs the human bitterreceptorsrdquo Biochemical and Biophysical Research Communica-tions vol 365 no 4 pp 851ndash855 2008

[10] K H Ney ldquoPrediction of bitterness of peptides from theiramino acid compositionrdquo Zeitschrift fur Lebensmittel-Untersu-chung und -Forschung vol 147 no 2 pp 64ndash68 1971

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003

BioMed Research International 5

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDGGRPFF 404 407 06 398 15 383 51PVLGPV 330 325 16 326 12 321 28FPPFIV 352 366 40 418 188 409 163RGPPGGV 248 307 239 293 182 296 192RGPPGFF 440 397 97 404 83 393 107RGPPFFF 500 445 111 462 76 447 106VIIPFPG 360 408 133 349 31 35 28VIFPPGR 410 433 55 387 56 383 66RGPPGIG 278 349 255 341 227 343 235RGPPGGF 308 325 54 317 30 310 08YPFPGPI 380 409 77 421 107 405 66RPPPFFF 470 473 06 441 62 422 102VIPFPGR 415 413 05 438 56 423 19PFPGPIP 360 354 16 400 112 398 107RGPPGFG 368 356 31 347 57 349 52RGPFPIV 395 385 25 392 07 388 19RPFFRPFF 500 511 22 496 09 503 06RGPKPIIV 408 387 50 421 32 383 63VYPFPPGI 382 423 107 421 101 415 87RGPEPIIV 451 387 142 412 87 380 158RGPPGGFF 411 337 180 328 201 328 203GGRPFFGG 440 430 23 392 108 381 133RGPPGGGFF 395 405 26 424 74 409 35GGRGPPFIV 410 395 36 435 61 430 49RGPPFIVGG 431 406 57 404 63 401 71FFRPFFRPFF 515 549 67 421 182 406 211PVRGPFPIIV 540 416 229 482 108 418 226VYPFPPGINH 430 448 43 451 49 438 20VYPFPPGIGG 352 420 192 308 124 300 149VYPFGGGINH 364 396 87 401 101 388 66RPFFRPFFRPFF 500 544 89 504 08 522 44RGPPFIVRGPPFIV 440 492 118 387 120 375 148PVLGPVRGPFPIIV 483 436 97 511 59 448 72

Test data setRG 211 197 64 229 84 201 46AV 116 171 476 159 374 158 363ID 137 207 514 207 512 197 439VD 190 194 22 236 245 215 134LV 223 189 151 175 216 182 184VI 223 190 147 199 109 196 121FP 277 254 83 265 43 246 111FF 301 275 86 289 40 282 65AF 181 241 334 255 407 239 318PGR 160 265 654 260 623 265 659RRR 240 355 480 352 468 355 478FIV 283 277 21 268 52 271 42GGY 283 244 138 255 97 245 135GRP 310 264 148 254 181 264 150YYG 320 310 32 317 09 311 29KPF 340 318 64 314 76 311 85GLG 200 158 213 157 215 169 157

Table 2 Continued

Peptide sequence Observed MLR IPD SVM IPD ANN IPDGPG 170 209 232 211 240 201 181KPK 252 297 177 282 118 302 199VYP 252 297 177 287 139 290 150GGGL 234 226 32 239 21 210 100RPFG 341 347 18 337 11 334 19RGFF 380 369 30 395 40 383 07GGLGG 190 229 206 193 17 192 09GGGGL 265 221 166 240 95 217 179PGPGPG 260 310 192 314 206 309 190PFPIIV 390 343 120 355 90 340 127VIFPPG 268 380 418 371 384 366 366RGPPFIV 430 386 103 375 127 378 121VYPFPPG 352 406 154 409 161 395 122RGPFPIIV 540 414 233 440 185 414 234RGPGPIIV 481 384 203 426 115 388 192RRPPPFFF 570 478 162 476 165 469 176VIIPFPGR 385 454 180 447 161 451 171VYPFPPIGNH 430 446 37 448 41 438 18GGRGPPFIVGG 440 421 43 421 44 411 66

Validation data setIN 149 211 419 212 424 201 348WW 360 299 169 318 117 305 153GV 174 222 278 219 258 208 198FPK 252 310 229 294 167 314 247FPP 234 265 133 256 96 262 121PGP 204 256 254 257 260 241 180VIF 289 278 38 279 36 282 23RPPFIV 410 374 87 361 120 365 110RGPPFGG 323 337 45 323 01 334 33RGPPFIIV 430 412 41 420 24 399 72

of the training and test sets is one of the most importantsteps in model development and should be done so thateach set reflects the original data set as much as possible 119870-means clustering is one of the most frequently used methodsof data set splitting This procedure identifies relativelyhomogeneous groups of molecules (each observation has thenearest distance to the mean of the cluster) based on selectedproperties (biologic activities and structural parameters)Using this method we divided the data set into 10 clustersand three subsets (ie training test and validation sets) wereselected from themThe validation set was excluded from thestudy before the descriptor selection step and the test set wasexcluded before the model development step

25 Descriptor Selection In order to reduce the dimensionof the variable matrix a correlation analysis was carriedout and both the highly correlated and constant variableswere excluded A home-developed MATLAB toolbox wasemployed that considered the following criteria for excludinga variable [1] the intercorrelation with other descriptorshigher than 099 [2] the correlation with activity lower than

6 BioMed Research International

Bitterness

Total descriptors

(1) Removing constantdescriptors

(2) Removing highlycorrelated descriptors

Descriptors Descriptors

Descriptors

Training test andvalidation sets selection

using K-means clustering

Fina

l sele

cted

de

scrip

tors

Stepwise regressiondescriptor selection G

A-PL

S se

lect

edde

scrip

tors

GA-PLS descriptorselection

Model development(1) MLR(2) SVM(3) ANN

Prediction of thebitterness for training test

and validation sets

Validation of thedeveloped modelsaccording to the

guidelines

Figure 1 Procedure of descriptor selection QSBR model development and validation processes

intercorrelated descriptors and [3] the frequency of repeatedvalues for a descriptor (lower than 10 of cases)

After this step a genetic algorithm-partial least square(GA-PLS) algorithm [36] was used to select the most signif-icant variables GA-PLS (a combination of genetic algorithmand PLS regression) was also developed and utilized as avariable selection method in QSAR and QSPR studies byLeardi [36] and applied for QSAR studies [37 38] TheMATLAB 78 software was used to run the GA-PLS methoddeveloped by Leardi The variables were divided into sub-groups (containing up to 200 variables) and each subgroupwith the corresponding log 1119879 values (119884) was introduced tothe algorithm as input The output (containing descriptorsscoring based on the cross-validated percentage of explainedvariance) was produced after 100 runs As the results of theGA-PLS were a bit different for each run the top 50 scoresof each subgroup were combined with the next 100 variablesand this step was repeated up to the final step A comparativestudy was done in this study to check the capability of GA-PLS for descriptor selection in comparison with commonmethods (ie PLS and stepwise) Results showed that GA-PLS was able to extract more significant descriptorsThe onlydrawback of this method was the scoring differences betweendifferent runs which could be solved using the mentionedprocedure (ie selection of 50 of each run instead of limiteddescriptors [37])

Twenty percent of high score variables of the final stepwere selected as the most significant variables and were fur-ther studied by stepwise regression and bivariate correlationanalysis in which the selected variables using stepwise regres-sionwere investigated concerning their intercorrelationsTheless intercorrelated variables were selected for the final modeldevelopment process The complete procedure of descriptorselection is summarized in Figure 1

26 Model Building Linear and nonlinear models weredeveloped using the selected descriptors The details of themodel development and validation process are discussed inthe next sections

27 Linear Model Using Multiple Linear Regression (MLR)The selected parameters were used for developing QSBRequations using the multiple linear regression correlation(MLR) method and the goodness of fit and statistical sig-nificance of the models were evaluated using 1198772 (coefficientof determination) 119865 (variance ratio) and the MPD (meanpercentage deviation) values calculated using

MPD = 100119873

sum

100381610038161003816100381610038161003816100381610038161003816

119884pred minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(1)

The relative frequency of individual percentage deviations(IPD) was studied in order to define the model predictioncapacity for each data point The IPD was calculated using

IPD = 100100381610038161003816100381610038161003816100381610038161003816

119884pred minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(2)

In order to compare the MPD values with the relativestandard deviations (RSD) between experimental bitter val-ues measured by different research groups (inter laboratoryrelative standard deviations (ILRSD)) the ILRSD values forthe available data were computed using

ILRSD = 100119873

100381610038161003816100381610038161003816100381610038161003816

119884exp minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(3)

BioMed Research International 7

28 Nonlinear Models Using Support Vector Regression (SVR)and Artificial Neural Network (ANN) In the next stage theselected descriptors were used to derive non-linear modelsusing SVR and ANN To construct an ANN the Levenberg-Marquardt algorithm [39] was used by nftool toolbox ofMATLAB 78 software to train the network Selected descrip-tors and the bitter activity of training sets were introducedas input and output values respectively The training set wasrandomly classified for training validation and test sets inorder to avoid overfitting and then networks were trainedSVR another non-linear model constructs a hyperplane ina multidimensional space that provides minimum error byemploying a non-linear Kernel function Some parameterssuch as capacity parameter (119862) 120576 are related to type of noisein the data and 120574 is related to radial base function (RBF)which is the most common type of Kernel functions TheSVMmodel and optimization of parameters were done usingSTATISTICA 7 software

29 Model Validation The developed models were evaluatedusing the leave-many-out (LMO) cross-validation methodThe 1199022 values were calculated using (A8) of the appendix

210 Chance Correlation To check the possibility of a chancecorrelation 10 times shuffled activities were correlated tothe variables and the produced regression coefficients werecompared with the developed model regression coefficient

211 External Validation of Proposed Models Using the Exter-nal Test Set A set of statistical criteria [30 31] was employedto analyze 36 data points of the test set

(1) 1198772 gt 06

where 1198772 is the coefficient of determination between thepredicted and observed values

(2) (1198772 minus 11987720)1198772lt 01 and 085 le 119870 le 115 or (1198772 minus

11987710158402

0)1198772lt 01 and 085 le 1198701015840 le 115

where 11987720is the coefficient of determination obtained using

predicted values relative to a regression line fit for experimen-tal values and required to pass through the origin and 11987710158402

0

is the corresponding coefficient obtained using experimentalvalues relative to a regression line fit for predicted values andrequired to pass through the origin 119870 and 1198701015840 are the slopesof regression lines through the origin for fits for experimentaland predicted data respectively

(3) |11987720minus 11987710158402

0| lt 03

In addition another criterion proposed by P P Roy andK Roy [32] was considered as

(4) 1198772119898= 1198772(1 minus radic119877

2minus 1198772

0)

in which 1198772119898gt 05 indicates the good external predictability

of the QSAR models

3 Results and Discussion

31 Outlier Detection According to standard score analysisthere is no outlier The PCA map of scores showed that mostof the data points fell into the acceptable data space and thereis no outlier in the studied data set

32 Training-Test Selection The selected training test andvalidation data sets are listed in Table 2 along with thecomputed accuracy criteria The activity ranges for trainingtest and validation sets were 100ndash540 116ndash570 and 149ndash540 respectivelyThe number of data points for the trainingtest and validation sets was 181 36 and 10 respectively

33 Descriptor Selection A total of 1292 descriptors were cal-culated using Dragon 54 softwareThis number was reducedto 244 descriptors after correlation analysis was performedusing a home-developed toolbox The remaining descriptorswere analyzed using the GA-PLS method and the best scor-ing descriptors (descriptors included inTable 3)were selectedfor further analysis using stepwise regression and bivariatecross-correlation studies (the intercorrelation between theselected descriptors was less than 09) After excluding thenonsignificant descriptors the remaining descriptors werechecked to find the possible intercorrelation and the finaldescriptors were selected using a stepwise regression method(see Table 4) The final six selected descriptors were againchecked for intercorrelation Six descriptors are suitable forthe construction of a QSAR model [40]

34 Evaluation of the Selected Descriptors SPAN belongsto the size descriptors which evaluate the dimension ofthe molecule and often calculate it from the moleculargeometry This descriptor is a suitable size descriptor formacromolecules [41] and its selection for the studied dataset with the developed method showed the method to besuitable for descriptor selectionThe correlation study results(Figure 2) showed that bitterness increased upon theenhancement of the SPAN (Figure 2(a)) This finding is inagreement with previous findings [1ndash13] which suggested asignificant correlation between peptide size and bitternessThe investigation of the correlation of peptide subsets withthis descriptor showed less correlations in comparison withthe whole dataset Therefore SPAN can be regarded as ageneral descriptor rather than a specific one for peptidesubsets Thus SPAN can be used in the primary steps ofpeptide bioactive compound discovery to recognize bittercompound fractions

Mean square distance (MSD) index (Balaban) [28 41]contributes negatively to bitterness (Figure 2(b)) MSD de-creases with the increasing of molecular branching in anisomeric set [41] Furthermore it decreases with an increasein the number of atomsThis is in agreementwith the findingson the increase in bitterness following an increase in thenumber of amino acids in the studied peptides The lowercorrelation coefficient between peptide subsets by MSD incomparison with the total dataset showed that MSD isa general bitterness indicator rather than a subset bitternessdescriptor

8 BioMed Research International

Table 3 GA-PLS selected descriptors along with descriptive fortraining data set (181 peptides)

Name Range Minimum Maximum Mean Std deviationrdf025p 11270 407 11677 3090 2113mor15u 381 minus190 191 018 065eeig08x 421 minus024 397 216 121mor11m 213 minus121 092 minus015 040rdf025v 10821 400 11220 2952 2028mor15e 373 minus157 216 026 070mor21m 386 minus393 minus008 minus090 064belp7 188 000 188 118 042rdf075m 14442 000 14442 1628 2418mor31p 185 014 198 054 036ggi6 387 000 387 077 074mor11v 254 minus100 155 010 040mor31v 159 011 170 045 031ncs 2900 100 3000 726 550ti1 621722 minus8725 612997 25545 69986e3s 064 000 064 021 010gmti 131885700 34700 131920400 8106404 17363751vep1 818 261 1078 528 163ats1p 288 181 468 312 060smtiv 77560100 34100 77594200 4840738 10318764behv6 284 096 379 282 053alogp 807 minus281 526 038 145rtp a 2563 490 3053 1302 531c002 1800 000 1800 413 338j3d 1070 241 1310 535 246idmt 142611569 18231 142629800 7586086 18440059mats2m 032 minus016 016 004 006mor11p 280 minus129 151 005 043mats2v 142611569 18231 142629800 7586086 18440059hats8u 070 000 070 033 013smti 47759800 16800 47776600 2905336 6331178mor05u 3643 minus3823 minus180 minus1014 651hats8e 068 000 068 034 013mats2e 031 minus015 015 003 006l1u 5345 241 5586 1298 933rbn 4600 100 4700 1161 803h6m 181 000 181 034 036rdf080e 38799 000 38799 3941 6182rtv a 2323 399 2721 1150 482

Bitterness increases with the enhancement of E3s(Figure 2(c)) E3s is one of the accessibility directionalWHIM indiceswhich isweighed by atomic electrotopologicalstates [28 41] Indeed this descriptor defines size shapepolarizability and the conformational properties of the stud-ied molecules together Its correlation with the bitterness isweaker than MSD and SPAN descriptors but its removalfrom the model decreases the correlation coefficient signif-icantly It is not a strong identifier for utilization in drugdevelopment processes

Table 4 Intercorrelation of final descriptors

log(1119879) SPAN Mor11v MSD HATS8u G3p E3slog(1119879) 100SPAN 079 100Mor11v 034 009 100MSD minus081 minus078 minus020 100HATS8u minus056 minus052 minus025 038 100G3p minus063 minus064 minus013 064 030 100E3s 041 022 031 minus038 minus016 minus025 100

G3p the 3rd-component symmetry directional WHIMindex (weighed by polarizability) [28 41] showed a negativecorrelation with bitterness (Figure 2(d)) Along with E3sthese descriptors contribute to the electrical properties of themolecule G3p decreases with the increase in the number ofamino acids In fact more bitter compounds possess largerG3p values Similar to the SPAN and MSD descriptors thecorrelations with subsets are less than the general datasetand this descriptor can be regarded as a general identifierHATS8u (Figure 2(e)) belongs to the GETAWAY descriptorsThese descriptors are molecular descriptors derived fromthe molecular influence matrix (MIM) [12 28] HATSindices known as spatial autocorrelation based descriptorsencode information nonstructural fragments [28 41] Theyare suitable for describing differences in a congeneric seriesof molecules The effective position of substituent andfragments in molecular space and information about themolecular size and shape can be encoded by unweightedHATS indices Moreover they are independent of moleculealignment

In summary GETAWAY descriptors encoded local infor-mation while WHIM descriptors related to the holisticinformation of the molecules thus their joint uses areadvised HATS8u decreases with an increase in molecularsize The negative relation of this descriptor with bitterness isin agreement with the findings about the size and bitternessrelation

3D-MoRSE descriptors (3D molecule representation ofstructures based on electron diffraction) are derived frominfrared spectra simulation using a generalized scatteringfunction [28] Mor11v weighed by van der Waals volumebelongs to these descriptors which can be regarded as anindicator of size mass and volume of the molecules Bydecreasing the size of the peptide the Mor11v values tendtoward zero (Figure 2(f)) In fact the absolute values ofMor11v decrease by the decrease in size of the molecule Theabsolute quantity of it worsens the model and it should beused in the model in the present form including negativeand positive values for different peptides The only similartrend was observed for alogp values of the peptides Thenegative Mor11v values belong to the molecules with neg-ative alogp values (less hydrophobic peptides) 3D-MoRSEdescriptors cannot encode the lipophilicity of the moleculesThe overall correlation of bitterness to this descriptor is posi-tive [28]

BioMed Research International 9

0

5

10

15

20

0 1 2 3 4 5 6

SPA

N

y = 21303x + 11869

R2 = 06217

log(1T)

(a)

0

01

02

03

04

05

0 1 2 3 4 5 6

MSD

y = minus00384x + 03733

R2 = 06613

log(1T)

(b)

0

02

04

06

08

0 1 2 3 4 5 6

E3S

y = 00419x + 00847

R2 = 01667

log(1T)(c)

G3p

01

01

02

02

03

0 1 2 3 4 5 6

y = minus00132x + 02095

R2 = 0396

log(1T)

(d)

00 2

0807060504030201

4 6

HAT

S8u

y = minus00744x + 05511

R2 = 0313

log(1T)(e)

0

1

2

15

05

0 2 4 6mor

11v

minus05

minus1

minus15

y = 01342x minus 03043

R2 = 01127

log(1T)

(f)

Figure 2 Correlation of selected descriptors with bitter activity (1198772 is coefficient of determination and is calculated using (A6) of theappendix)

35 Model Building Using Linear Model (MLR) The selecteddescriptors were used to develop six MLR equations contain-ing 1ndash6 descriptors The adjusted 1198772 showed that all descrip-tors improved the model fitting the 1198772 values accuratelydepicted the fitting improvement and the best-fitted modelcould be rewritten as follows

log( 1119879

) = 545 (plusmn063) + 010 (plusmn002) SPAN

+ 032 (plusmn009)Mor11v

minus 788 (plusmn125)MSD

minus 155 (plusmn030)HATS8u

minus 539 (plusmn21)G3p + 092 (plusmn035)E3s

1198772= 081 119865 = 12573

SEP = 008 MPD = 137 (plusmn132)

ND = 181(4)

where SEP stands for standard error of prediction and NDshows the number of data points The MPD for test andvalidation sets were 181 (plusmn155) (119873 = 36) and 169 (plusmn126)(119873 = 10) respectively The relative frequency analysis ofthe prediction errors showed that more than 50 of data canbe predicted by the prediction error of less than 15 whichis acceptable for biological measurements where the meanILRSD for bitter activities of 19 dipeptides were measured bydifferent research groups and were 121 (plusmn107) In additionthe IPD frequency trend (Figure 3) is similar for trainingand test sets The MPD for the peptides by log(1119879) valuesless than 20 was 243 (plusmn222) and for the peptides with

10 BioMed Research International

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

trai

ning

set

(a)

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

test

set

(b)

0

10

20

30

40

50

60

70

MLR SVM ANN

IPD

freq

uenc

y of

val

idat

ion

set

lt15

15ndash30gt30

(c)

Figure 3 IPD frequencies (IPD lt 15 IPD = 15ndash30 and IPD gt 30)of the training (top) test (middle) and validation (bottom) sets forMLR SVM and ANNmodels

log(1119879) more than 20 it was 111 (plusmn79) The highest IPDvalue (1000) in the training set was calculated for GR(log(1119879) = 100) and in the test set it was calculated forPGR (IPD = 654 and log(1119879) = 160) In other wordsthe developed model was not able to predict the bitterness ofless bitter peptides and the selected descriptors are specificfor the evaluation of the bitter peptidesrsquo interactions with thereceptor

LOO cross-validation was done using the PLS toolbox ofMATLAB software The 1199022 value and RMSE were 076 and044 respectivelyThe 1199022 values for LMO cross-validation arereported in Table 3 The ranges of 1199022 values were 061ndash088which was in an acceptable range compared with 1198772 values

Table 5 Leave-many-out cross-validation results for MLR model

Subset 1198772

1198772

adj 1199022

1 080 079 0882 080 079 0873 080 079 0864 080 080 0825 080 080 0826 081 080 0757 081 080 0788 082 081 0769 082 081 08410 082 082 061

Table 6 Chance correlation results

Shuffled 119884 1198772 Shuffled 119884 119877

2

1198841 001 1198846 0021198842 007 1198847 minus0021198843 001 1198848 0031198844 minus002 1198849 0011198845 001 11988410 000

The ranges of corresponding 1198772 and adjusted 1198772 for thedesired subsets were 080ndash082 and 079ndash082 respectively(see Table 5) Chance correlation (119884 randomization) analysiswas done using 10 times shuffled bitter activity and the results(1198772 = 001ndash010) rejected the possibility of fortune correlation(see Table 6)

36 Model Building Using Nonlinear Models (ANN and SVM)and Comparison with the Linear Model (MLR) The sixselected descriptors were introduced to ANN as input valuesand the bitter activity as output data and the networks weredeveloped using the Levenberg-Marquardt algorithm [38]The number of hidden layers was three In addition SVMmodels were developed using STATISTICA 7 software Usingthe training set three parameters of SVM were optimized by10-fold cross-validation The optimized values of 119862 120576 and 120574were 91 007 and 006 respectively The MPD values of theproposed models for training test and validation sets were138 168 and 169 for MLR 130 160 and 150 for SVM and131 161 and 148 for ANN all of which are in acceptableranges and there is no significant difference between thesesubsets The IPD frequencies (Figure 3) revealed that bothSVM and ANN methods produced more accurate resultsespecially for the test sets The plots of experimental versuspredicted values (Figure 4) confirmed the discussed results

37 External Validation of the Proposed Models Statisticalcriteria of external test sets are depicted in Table 7 Theresults show that the proposed models using linear andnonlinear models passed the proposed statistical criteria in

BioMed Research International 11

MLR

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

SVM

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

ANN

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

y = 0785x + 0625

R = 0892y = 0784x + 0596

R = 0899

y = 0800x + 0624

R = 0907

Figure 4 Experimental versus predicted plots for linear and nonlinear models (119877 is calculated using (A7) of the appendix)

Table 7 Statistical parameters for test sets of MLR SVM and ANNmodels

Statistical criteria MLR SVM ANN1198772gt 06 0723 0739 0767(1198772minus 1198772

0) 1198772lt 01 0001 0002 0004

085 le 119870 le 115 0999 1010 098910038161003816100381610038161198772

0minus 11987710158402

0

1003816100381610038161003816lt 03 0109 0113 0106

1198772

119898gt 05 0704 0712 0722

the literature and they are robust and valid for externalprediction

38 Comparison with Previous Model The only similarstudy was published by Kim and Li-Chan [3] They usedthe amino acid three 119911-scores along with three parameters(total hydrophobicity residue number and logmass value) todevelop a PLS model A comparison of the newly developedmodel with their model is summarized in Table 8 It shouldbe noted that the aim of this paper was to develop a generalmodel rather than different models for different subsets andreported correlation coefficients belong to the general modelwhich was computed for subsets The comparison of thecorresponding 119877 values showed that the developed modelcould represent the bitter activity variance of three and morepeptides better than the PLS model while for dipeptides

the PLS model produced more accurate results DevelopedSVM and ANN models resulted in more accurate results forthe total data set compared with both the PLS and MLRmodels (Tables 1 and 5 Figure 4)

4 Conclusion

General MLR SVM and ANN models were developed topredict the bitterness of 229 peptides and amino acidsThe capability of the MLR model to reveal the impact ofeach descriptor on bitter activity was its main advantagewhere more accurate predictions by SVM and ANN madethem suitable models for precise predictions Obviouslyindividual models (ie models developed for each peptidesubset) produced less prediction errors but consideringthe convenience of application of the general models suchmodels are preferred during the primary stages of peptideproduction and evaluation The developed models can beused in nutraceutical and pharmaceutical industries

Appendix

RMSE = radicsum(119884pred minus 119884exp(mean))

2

119873

(A1)

where 119873 denotes the number of data points and 119884pred and119884exp are the predicted and observed log 1119879 (119879 is the bitter

12 BioMed Research International

Table 8 Developed MLR model statistics for subsets of peptides compared with a previously developed model

Data set Developed model Previous modelND 119877

21198772

adj 119877 RMSE 119877 (PLS) 1198772 (PLS)

Dipeptidesa 76 051 047 072 041 063 040Dipeptidesb 45 062 056 079 044 091 083Dipeptidesc 47 059 053 077 042 085 072Three peptides 51 065 060 081 038 071 050Tetrapeptides 23 072 062 085 048 090 081Pentapeptides 12 089 080 094 037 088 077Hexapeptides 20 065 048 080 047 075 056Heptapeptides 16 079 068 089 038 095 090Octa-tetradecapeptides 24 057 042 076 044 mdash mdashWhole data setd 227 080 079 089 046 081 066Test and validation sets 46 076 073 087 056 mdash mdashTraining set 181 081 081 090 043 mdash mdashaAverage of experimental values was used when there were different values in different referencesbExperimental data were taken from different referencescExperimental data were taken from [3]dThe 119877 values for whole data set using SVM and ANNmethods are 090 and 091 respectively

threshold concentration (119872)) values

MSSReg =sum SSReg

df (A2)

MSSRes =sum SSRes

df (A3)

where SS is sum of squares MSS is mean square Reg isregression Res is residual and cv is cross validated

MSSTotal = MSSReg +MSSRes (A4)

119865 =

MSSRegMSSRes

(A5)

1198772= 1 minus

MSSResMSSTotal

(A6)

119877 = (1 minus

MSSResMSSTotal

)

05

(A7)

1199022= 1198772

cv = 1 minusMSSRescvMSSTotal

(A8)

Conflict of Interests

The authors certify that there is no financial relation betweenthem and the mentioned commercial identities in the paper

Acknowledgments

Theauthors thank the Biotechnology Research Center TabrizUniversity of Medical Sciences for a partial financial supportunder Grant no 585765

References

[1] A H Pripp and Y Ardo ldquoModelling relationship betweenangiotensin-(I)-converting enzyme inhibition and the bittertaste of peptidesrdquo Food Chemistry vol 102 no 3 pp 880ndash8882007

[2] A H Pripp T Isaksson L Stepaniak T Soslashrhaug and Y ArdoldquoQuantitative structure activity relationship modelling of pep-tides and proteins as a tool in food sciencerdquo Trends in FoodScience and Technology vol 16 no 11 pp 484ndash494 2005

[3] H O Kim and E C Y Li-Chan ldquoQuantitative structure-activityrelationship study of bitter peptidesrdquo Journal of Agricultural andFood Chemistry vol 54 no 26 pp 10102ndash10111 2006

[4] K Maehashi and L Huang ldquoBitter peptides and bitter tastereceptorsrdquo Cellular and Molecular Life Sciences vol 66 no 10pp 1661ndash1671 2009

[5] E G Walsh B E Adamczyk K B Chalasani et al ldquoOral deliv-ery of macromolecules rationale underpinning gastrointestinalpermeation enhancement technology (GIPET)rdquo TherapeuticDelivery vol 2 no 12 pp 1595ndash1610 2011

[6] R C Pinto C Silva N Martinho et al ldquoDrug carriers for oraldelivery of peptides and proteins accomplishments and futureperspectivesrdquoTherapeutic Delivery vol 4 pp 251ndash265 2013

[7] J Renukuntla A D Vadlapudi A Patel et al ldquoApproachesfor enhancing oral bioavailability of peptides and proteinsrdquoInternational Journal of Pharmaceutics vol 447 pp 75ndash93 2013

[8] S Swain ldquoStabilization and delivery approaches for protein andpeptide pharmaceuticals an extensive review of patentsrdquoRecentPatents on Biotechnology In press

[9] K Maehashi M Matano H Wang L A Vo Y Yamamoto andL Huang ldquoBitter peptides activate hTAS2Rs the human bitterreceptorsrdquo Biochemical and Biophysical Research Communica-tions vol 365 no 4 pp 851ndash855 2008

[10] K H Ney ldquoPrediction of bitterness of peptides from theiramino acid compositionrdquo Zeitschrift fur Lebensmittel-Untersu-chung und -Forschung vol 147 no 2 pp 64ndash68 1971

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003

6 BioMed Research International

Bitterness

Total descriptors

(1) Removing constantdescriptors

(2) Removing highlycorrelated descriptors

Descriptors Descriptors

Descriptors

Training test andvalidation sets selection

using K-means clustering

Fina

l sele

cted

de

scrip

tors

Stepwise regressiondescriptor selection G

A-PL

S se

lect

edde

scrip

tors

GA-PLS descriptorselection

Model development(1) MLR(2) SVM(3) ANN

Prediction of thebitterness for training test

and validation sets

Validation of thedeveloped modelsaccording to the

guidelines

Figure 1 Procedure of descriptor selection QSBR model development and validation processes

intercorrelated descriptors and [3] the frequency of repeatedvalues for a descriptor (lower than 10 of cases)

After this step a genetic algorithm-partial least square(GA-PLS) algorithm [36] was used to select the most signif-icant variables GA-PLS (a combination of genetic algorithmand PLS regression) was also developed and utilized as avariable selection method in QSAR and QSPR studies byLeardi [36] and applied for QSAR studies [37 38] TheMATLAB 78 software was used to run the GA-PLS methoddeveloped by Leardi The variables were divided into sub-groups (containing up to 200 variables) and each subgroupwith the corresponding log 1119879 values (119884) was introduced tothe algorithm as input The output (containing descriptorsscoring based on the cross-validated percentage of explainedvariance) was produced after 100 runs As the results of theGA-PLS were a bit different for each run the top 50 scoresof each subgroup were combined with the next 100 variablesand this step was repeated up to the final step A comparativestudy was done in this study to check the capability of GA-PLS for descriptor selection in comparison with commonmethods (ie PLS and stepwise) Results showed that GA-PLS was able to extract more significant descriptorsThe onlydrawback of this method was the scoring differences betweendifferent runs which could be solved using the mentionedprocedure (ie selection of 50 of each run instead of limiteddescriptors [37])

Twenty percent of high score variables of the final stepwere selected as the most significant variables and were fur-ther studied by stepwise regression and bivariate correlationanalysis in which the selected variables using stepwise regres-sionwere investigated concerning their intercorrelationsTheless intercorrelated variables were selected for the final modeldevelopment process The complete procedure of descriptorselection is summarized in Figure 1

26 Model Building Linear and nonlinear models weredeveloped using the selected descriptors The details of themodel development and validation process are discussed inthe next sections

27 Linear Model Using Multiple Linear Regression (MLR)The selected parameters were used for developing QSBRequations using the multiple linear regression correlation(MLR) method and the goodness of fit and statistical sig-nificance of the models were evaluated using 1198772 (coefficientof determination) 119865 (variance ratio) and the MPD (meanpercentage deviation) values calculated using

MPD = 100119873

sum

100381610038161003816100381610038161003816100381610038161003816

119884pred minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(1)

The relative frequency of individual percentage deviations(IPD) was studied in order to define the model predictioncapacity for each data point The IPD was calculated using

IPD = 100100381610038161003816100381610038161003816100381610038161003816

119884pred minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(2)

In order to compare the MPD values with the relativestandard deviations (RSD) between experimental bitter val-ues measured by different research groups (inter laboratoryrelative standard deviations (ILRSD)) the ILRSD values forthe available data were computed using

ILRSD = 100119873

100381610038161003816100381610038161003816100381610038161003816

119884exp minus 119884exp(mean)

119884exp(mean)

100381610038161003816100381610038161003816100381610038161003816

(3)

BioMed Research International 7

28 Nonlinear Models Using Support Vector Regression (SVR)and Artificial Neural Network (ANN) In the next stage theselected descriptors were used to derive non-linear modelsusing SVR and ANN To construct an ANN the Levenberg-Marquardt algorithm [39] was used by nftool toolbox ofMATLAB 78 software to train the network Selected descrip-tors and the bitter activity of training sets were introducedas input and output values respectively The training set wasrandomly classified for training validation and test sets inorder to avoid overfitting and then networks were trainedSVR another non-linear model constructs a hyperplane ina multidimensional space that provides minimum error byemploying a non-linear Kernel function Some parameterssuch as capacity parameter (119862) 120576 are related to type of noisein the data and 120574 is related to radial base function (RBF)which is the most common type of Kernel functions TheSVMmodel and optimization of parameters were done usingSTATISTICA 7 software

29 Model Validation The developed models were evaluatedusing the leave-many-out (LMO) cross-validation methodThe 1199022 values were calculated using (A8) of the appendix

210 Chance Correlation To check the possibility of a chancecorrelation 10 times shuffled activities were correlated tothe variables and the produced regression coefficients werecompared with the developed model regression coefficient

211 External Validation of Proposed Models Using the Exter-nal Test Set A set of statistical criteria [30 31] was employedto analyze 36 data points of the test set

(1) 1198772 gt 06

where 1198772 is the coefficient of determination between thepredicted and observed values

(2) (1198772 minus 11987720)1198772lt 01 and 085 le 119870 le 115 or (1198772 minus

11987710158402

0)1198772lt 01 and 085 le 1198701015840 le 115

where 11987720is the coefficient of determination obtained using

predicted values relative to a regression line fit for experimen-tal values and required to pass through the origin and 11987710158402

0

is the corresponding coefficient obtained using experimentalvalues relative to a regression line fit for predicted values andrequired to pass through the origin 119870 and 1198701015840 are the slopesof regression lines through the origin for fits for experimentaland predicted data respectively

(3) |11987720minus 11987710158402

0| lt 03

In addition another criterion proposed by P P Roy andK Roy [32] was considered as

(4) 1198772119898= 1198772(1 minus radic119877

2minus 1198772

0)

in which 1198772119898gt 05 indicates the good external predictability

of the QSAR models

3 Results and Discussion

31 Outlier Detection According to standard score analysisthere is no outlier The PCA map of scores showed that mostof the data points fell into the acceptable data space and thereis no outlier in the studied data set

32 Training-Test Selection The selected training test andvalidation data sets are listed in Table 2 along with thecomputed accuracy criteria The activity ranges for trainingtest and validation sets were 100ndash540 116ndash570 and 149ndash540 respectivelyThe number of data points for the trainingtest and validation sets was 181 36 and 10 respectively

33 Descriptor Selection A total of 1292 descriptors were cal-culated using Dragon 54 softwareThis number was reducedto 244 descriptors after correlation analysis was performedusing a home-developed toolbox The remaining descriptorswere analyzed using the GA-PLS method and the best scor-ing descriptors (descriptors included inTable 3)were selectedfor further analysis using stepwise regression and bivariatecross-correlation studies (the intercorrelation between theselected descriptors was less than 09) After excluding thenonsignificant descriptors the remaining descriptors werechecked to find the possible intercorrelation and the finaldescriptors were selected using a stepwise regression method(see Table 4) The final six selected descriptors were againchecked for intercorrelation Six descriptors are suitable forthe construction of a QSAR model [40]

34 Evaluation of the Selected Descriptors SPAN belongsto the size descriptors which evaluate the dimension ofthe molecule and often calculate it from the moleculargeometry This descriptor is a suitable size descriptor formacromolecules [41] and its selection for the studied dataset with the developed method showed the method to besuitable for descriptor selectionThe correlation study results(Figure 2) showed that bitterness increased upon theenhancement of the SPAN (Figure 2(a)) This finding is inagreement with previous findings [1ndash13] which suggested asignificant correlation between peptide size and bitternessThe investigation of the correlation of peptide subsets withthis descriptor showed less correlations in comparison withthe whole dataset Therefore SPAN can be regarded as ageneral descriptor rather than a specific one for peptidesubsets Thus SPAN can be used in the primary steps ofpeptide bioactive compound discovery to recognize bittercompound fractions

Mean square distance (MSD) index (Balaban) [28 41]contributes negatively to bitterness (Figure 2(b)) MSD de-creases with the increasing of molecular branching in anisomeric set [41] Furthermore it decreases with an increasein the number of atomsThis is in agreementwith the findingson the increase in bitterness following an increase in thenumber of amino acids in the studied peptides The lowercorrelation coefficient between peptide subsets by MSD incomparison with the total dataset showed that MSD isa general bitterness indicator rather than a subset bitternessdescriptor

8 BioMed Research International

Table 3 GA-PLS selected descriptors along with descriptive fortraining data set (181 peptides)

Name Range Minimum Maximum Mean Std deviationrdf025p 11270 407 11677 3090 2113mor15u 381 minus190 191 018 065eeig08x 421 minus024 397 216 121mor11m 213 minus121 092 minus015 040rdf025v 10821 400 11220 2952 2028mor15e 373 minus157 216 026 070mor21m 386 minus393 minus008 minus090 064belp7 188 000 188 118 042rdf075m 14442 000 14442 1628 2418mor31p 185 014 198 054 036ggi6 387 000 387 077 074mor11v 254 minus100 155 010 040mor31v 159 011 170 045 031ncs 2900 100 3000 726 550ti1 621722 minus8725 612997 25545 69986e3s 064 000 064 021 010gmti 131885700 34700 131920400 8106404 17363751vep1 818 261 1078 528 163ats1p 288 181 468 312 060smtiv 77560100 34100 77594200 4840738 10318764behv6 284 096 379 282 053alogp 807 minus281 526 038 145rtp a 2563 490 3053 1302 531c002 1800 000 1800 413 338j3d 1070 241 1310 535 246idmt 142611569 18231 142629800 7586086 18440059mats2m 032 minus016 016 004 006mor11p 280 minus129 151 005 043mats2v 142611569 18231 142629800 7586086 18440059hats8u 070 000 070 033 013smti 47759800 16800 47776600 2905336 6331178mor05u 3643 minus3823 minus180 minus1014 651hats8e 068 000 068 034 013mats2e 031 minus015 015 003 006l1u 5345 241 5586 1298 933rbn 4600 100 4700 1161 803h6m 181 000 181 034 036rdf080e 38799 000 38799 3941 6182rtv a 2323 399 2721 1150 482

Bitterness increases with the enhancement of E3s(Figure 2(c)) E3s is one of the accessibility directionalWHIM indiceswhich isweighed by atomic electrotopologicalstates [28 41] Indeed this descriptor defines size shapepolarizability and the conformational properties of the stud-ied molecules together Its correlation with the bitterness isweaker than MSD and SPAN descriptors but its removalfrom the model decreases the correlation coefficient signif-icantly It is not a strong identifier for utilization in drugdevelopment processes

Table 4 Intercorrelation of final descriptors

log(1119879) SPAN Mor11v MSD HATS8u G3p E3slog(1119879) 100SPAN 079 100Mor11v 034 009 100MSD minus081 minus078 minus020 100HATS8u minus056 minus052 minus025 038 100G3p minus063 minus064 minus013 064 030 100E3s 041 022 031 minus038 minus016 minus025 100

G3p the 3rd-component symmetry directional WHIMindex (weighed by polarizability) [28 41] showed a negativecorrelation with bitterness (Figure 2(d)) Along with E3sthese descriptors contribute to the electrical properties of themolecule G3p decreases with the increase in the number ofamino acids In fact more bitter compounds possess largerG3p values Similar to the SPAN and MSD descriptors thecorrelations with subsets are less than the general datasetand this descriptor can be regarded as a general identifierHATS8u (Figure 2(e)) belongs to the GETAWAY descriptorsThese descriptors are molecular descriptors derived fromthe molecular influence matrix (MIM) [12 28] HATSindices known as spatial autocorrelation based descriptorsencode information nonstructural fragments [28 41] Theyare suitable for describing differences in a congeneric seriesof molecules The effective position of substituent andfragments in molecular space and information about themolecular size and shape can be encoded by unweightedHATS indices Moreover they are independent of moleculealignment

In summary GETAWAY descriptors encoded local infor-mation while WHIM descriptors related to the holisticinformation of the molecules thus their joint uses areadvised HATS8u decreases with an increase in molecularsize The negative relation of this descriptor with bitterness isin agreement with the findings about the size and bitternessrelation

3D-MoRSE descriptors (3D molecule representation ofstructures based on electron diffraction) are derived frominfrared spectra simulation using a generalized scatteringfunction [28] Mor11v weighed by van der Waals volumebelongs to these descriptors which can be regarded as anindicator of size mass and volume of the molecules Bydecreasing the size of the peptide the Mor11v values tendtoward zero (Figure 2(f)) In fact the absolute values ofMor11v decrease by the decrease in size of the molecule Theabsolute quantity of it worsens the model and it should beused in the model in the present form including negativeand positive values for different peptides The only similartrend was observed for alogp values of the peptides Thenegative Mor11v values belong to the molecules with neg-ative alogp values (less hydrophobic peptides) 3D-MoRSEdescriptors cannot encode the lipophilicity of the moleculesThe overall correlation of bitterness to this descriptor is posi-tive [28]

BioMed Research International 9

0

5

10

15

20

0 1 2 3 4 5 6

SPA

N

y = 21303x + 11869

R2 = 06217

log(1T)

(a)

0

01

02

03

04

05

0 1 2 3 4 5 6

MSD

y = minus00384x + 03733

R2 = 06613

log(1T)

(b)

0

02

04

06

08

0 1 2 3 4 5 6

E3S

y = 00419x + 00847

R2 = 01667

log(1T)(c)

G3p

01

01

02

02

03

0 1 2 3 4 5 6

y = minus00132x + 02095

R2 = 0396

log(1T)

(d)

00 2

0807060504030201

4 6

HAT

S8u

y = minus00744x + 05511

R2 = 0313

log(1T)(e)

0

1

2

15

05

0 2 4 6mor

11v

minus05

minus1

minus15

y = 01342x minus 03043

R2 = 01127

log(1T)

(f)

Figure 2 Correlation of selected descriptors with bitter activity (1198772 is coefficient of determination and is calculated using (A6) of theappendix)

35 Model Building Using Linear Model (MLR) The selecteddescriptors were used to develop six MLR equations contain-ing 1ndash6 descriptors The adjusted 1198772 showed that all descrip-tors improved the model fitting the 1198772 values accuratelydepicted the fitting improvement and the best-fitted modelcould be rewritten as follows

log( 1119879

) = 545 (plusmn063) + 010 (plusmn002) SPAN

+ 032 (plusmn009)Mor11v

minus 788 (plusmn125)MSD

minus 155 (plusmn030)HATS8u

minus 539 (plusmn21)G3p + 092 (plusmn035)E3s

1198772= 081 119865 = 12573

SEP = 008 MPD = 137 (plusmn132)

ND = 181(4)

where SEP stands for standard error of prediction and NDshows the number of data points The MPD for test andvalidation sets were 181 (plusmn155) (119873 = 36) and 169 (plusmn126)(119873 = 10) respectively The relative frequency analysis ofthe prediction errors showed that more than 50 of data canbe predicted by the prediction error of less than 15 whichis acceptable for biological measurements where the meanILRSD for bitter activities of 19 dipeptides were measured bydifferent research groups and were 121 (plusmn107) In additionthe IPD frequency trend (Figure 3) is similar for trainingand test sets The MPD for the peptides by log(1119879) valuesless than 20 was 243 (plusmn222) and for the peptides with

10 BioMed Research International

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

trai

ning

set

(a)

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

test

set

(b)

0

10

20

30

40

50

60

70

MLR SVM ANN

IPD

freq

uenc

y of

val

idat

ion

set

lt15

15ndash30gt30

(c)

Figure 3 IPD frequencies (IPD lt 15 IPD = 15ndash30 and IPD gt 30)of the training (top) test (middle) and validation (bottom) sets forMLR SVM and ANNmodels

log(1119879) more than 20 it was 111 (plusmn79) The highest IPDvalue (1000) in the training set was calculated for GR(log(1119879) = 100) and in the test set it was calculated forPGR (IPD = 654 and log(1119879) = 160) In other wordsthe developed model was not able to predict the bitterness ofless bitter peptides and the selected descriptors are specificfor the evaluation of the bitter peptidesrsquo interactions with thereceptor

LOO cross-validation was done using the PLS toolbox ofMATLAB software The 1199022 value and RMSE were 076 and044 respectivelyThe 1199022 values for LMO cross-validation arereported in Table 3 The ranges of 1199022 values were 061ndash088which was in an acceptable range compared with 1198772 values

Table 5 Leave-many-out cross-validation results for MLR model

Subset 1198772

1198772

adj 1199022

1 080 079 0882 080 079 0873 080 079 0864 080 080 0825 080 080 0826 081 080 0757 081 080 0788 082 081 0769 082 081 08410 082 082 061

Table 6 Chance correlation results

Shuffled 119884 1198772 Shuffled 119884 119877

2

1198841 001 1198846 0021198842 007 1198847 minus0021198843 001 1198848 0031198844 minus002 1198849 0011198845 001 11988410 000

The ranges of corresponding 1198772 and adjusted 1198772 for thedesired subsets were 080ndash082 and 079ndash082 respectively(see Table 5) Chance correlation (119884 randomization) analysiswas done using 10 times shuffled bitter activity and the results(1198772 = 001ndash010) rejected the possibility of fortune correlation(see Table 6)

36 Model Building Using Nonlinear Models (ANN and SVM)and Comparison with the Linear Model (MLR) The sixselected descriptors were introduced to ANN as input valuesand the bitter activity as output data and the networks weredeveloped using the Levenberg-Marquardt algorithm [38]The number of hidden layers was three In addition SVMmodels were developed using STATISTICA 7 software Usingthe training set three parameters of SVM were optimized by10-fold cross-validation The optimized values of 119862 120576 and 120574were 91 007 and 006 respectively The MPD values of theproposed models for training test and validation sets were138 168 and 169 for MLR 130 160 and 150 for SVM and131 161 and 148 for ANN all of which are in acceptableranges and there is no significant difference between thesesubsets The IPD frequencies (Figure 3) revealed that bothSVM and ANN methods produced more accurate resultsespecially for the test sets The plots of experimental versuspredicted values (Figure 4) confirmed the discussed results

37 External Validation of the Proposed Models Statisticalcriteria of external test sets are depicted in Table 7 Theresults show that the proposed models using linear andnonlinear models passed the proposed statistical criteria in

BioMed Research International 11

MLR

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

SVM

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

ANN

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

y = 0785x + 0625

R = 0892y = 0784x + 0596

R = 0899

y = 0800x + 0624

R = 0907

Figure 4 Experimental versus predicted plots for linear and nonlinear models (119877 is calculated using (A7) of the appendix)

Table 7 Statistical parameters for test sets of MLR SVM and ANNmodels

Statistical criteria MLR SVM ANN1198772gt 06 0723 0739 0767(1198772minus 1198772

0) 1198772lt 01 0001 0002 0004

085 le 119870 le 115 0999 1010 098910038161003816100381610038161198772

0minus 11987710158402

0

1003816100381610038161003816lt 03 0109 0113 0106

1198772

119898gt 05 0704 0712 0722

the literature and they are robust and valid for externalprediction

38 Comparison with Previous Model The only similarstudy was published by Kim and Li-Chan [3] They usedthe amino acid three 119911-scores along with three parameters(total hydrophobicity residue number and logmass value) todevelop a PLS model A comparison of the newly developedmodel with their model is summarized in Table 8 It shouldbe noted that the aim of this paper was to develop a generalmodel rather than different models for different subsets andreported correlation coefficients belong to the general modelwhich was computed for subsets The comparison of thecorresponding 119877 values showed that the developed modelcould represent the bitter activity variance of three and morepeptides better than the PLS model while for dipeptides

the PLS model produced more accurate results DevelopedSVM and ANN models resulted in more accurate results forthe total data set compared with both the PLS and MLRmodels (Tables 1 and 5 Figure 4)

4 Conclusion

General MLR SVM and ANN models were developed topredict the bitterness of 229 peptides and amino acidsThe capability of the MLR model to reveal the impact ofeach descriptor on bitter activity was its main advantagewhere more accurate predictions by SVM and ANN madethem suitable models for precise predictions Obviouslyindividual models (ie models developed for each peptidesubset) produced less prediction errors but consideringthe convenience of application of the general models suchmodels are preferred during the primary stages of peptideproduction and evaluation The developed models can beused in nutraceutical and pharmaceutical industries

Appendix

RMSE = radicsum(119884pred minus 119884exp(mean))

2

119873

(A1)

where 119873 denotes the number of data points and 119884pred and119884exp are the predicted and observed log 1119879 (119879 is the bitter

12 BioMed Research International

Table 8 Developed MLR model statistics for subsets of peptides compared with a previously developed model

Data set Developed model Previous modelND 119877

21198772

adj 119877 RMSE 119877 (PLS) 1198772 (PLS)

Dipeptidesa 76 051 047 072 041 063 040Dipeptidesb 45 062 056 079 044 091 083Dipeptidesc 47 059 053 077 042 085 072Three peptides 51 065 060 081 038 071 050Tetrapeptides 23 072 062 085 048 090 081Pentapeptides 12 089 080 094 037 088 077Hexapeptides 20 065 048 080 047 075 056Heptapeptides 16 079 068 089 038 095 090Octa-tetradecapeptides 24 057 042 076 044 mdash mdashWhole data setd 227 080 079 089 046 081 066Test and validation sets 46 076 073 087 056 mdash mdashTraining set 181 081 081 090 043 mdash mdashaAverage of experimental values was used when there were different values in different referencesbExperimental data were taken from different referencescExperimental data were taken from [3]dThe 119877 values for whole data set using SVM and ANNmethods are 090 and 091 respectively

threshold concentration (119872)) values

MSSReg =sum SSReg

df (A2)

MSSRes =sum SSRes

df (A3)

where SS is sum of squares MSS is mean square Reg isregression Res is residual and cv is cross validated

MSSTotal = MSSReg +MSSRes (A4)

119865 =

MSSRegMSSRes

(A5)

1198772= 1 minus

MSSResMSSTotal

(A6)

119877 = (1 minus

MSSResMSSTotal

)

05

(A7)

1199022= 1198772

cv = 1 minusMSSRescvMSSTotal

(A8)

Conflict of Interests

The authors certify that there is no financial relation betweenthem and the mentioned commercial identities in the paper

Acknowledgments

Theauthors thank the Biotechnology Research Center TabrizUniversity of Medical Sciences for a partial financial supportunder Grant no 585765

References

[1] A H Pripp and Y Ardo ldquoModelling relationship betweenangiotensin-(I)-converting enzyme inhibition and the bittertaste of peptidesrdquo Food Chemistry vol 102 no 3 pp 880ndash8882007

[2] A H Pripp T Isaksson L Stepaniak T Soslashrhaug and Y ArdoldquoQuantitative structure activity relationship modelling of pep-tides and proteins as a tool in food sciencerdquo Trends in FoodScience and Technology vol 16 no 11 pp 484ndash494 2005

[3] H O Kim and E C Y Li-Chan ldquoQuantitative structure-activityrelationship study of bitter peptidesrdquo Journal of Agricultural andFood Chemistry vol 54 no 26 pp 10102ndash10111 2006

[4] K Maehashi and L Huang ldquoBitter peptides and bitter tastereceptorsrdquo Cellular and Molecular Life Sciences vol 66 no 10pp 1661ndash1671 2009

[5] E G Walsh B E Adamczyk K B Chalasani et al ldquoOral deliv-ery of macromolecules rationale underpinning gastrointestinalpermeation enhancement technology (GIPET)rdquo TherapeuticDelivery vol 2 no 12 pp 1595ndash1610 2011

[6] R C Pinto C Silva N Martinho et al ldquoDrug carriers for oraldelivery of peptides and proteins accomplishments and futureperspectivesrdquoTherapeutic Delivery vol 4 pp 251ndash265 2013

[7] J Renukuntla A D Vadlapudi A Patel et al ldquoApproachesfor enhancing oral bioavailability of peptides and proteinsrdquoInternational Journal of Pharmaceutics vol 447 pp 75ndash93 2013

[8] S Swain ldquoStabilization and delivery approaches for protein andpeptide pharmaceuticals an extensive review of patentsrdquoRecentPatents on Biotechnology In press

[9] K Maehashi M Matano H Wang L A Vo Y Yamamoto andL Huang ldquoBitter peptides activate hTAS2Rs the human bitterreceptorsrdquo Biochemical and Biophysical Research Communica-tions vol 365 no 4 pp 851ndash855 2008

[10] K H Ney ldquoPrediction of bitterness of peptides from theiramino acid compositionrdquo Zeitschrift fur Lebensmittel-Untersu-chung und -Forschung vol 147 no 2 pp 64ndash68 1971

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003

BioMed Research International 7

28 Nonlinear Models Using Support Vector Regression (SVR)and Artificial Neural Network (ANN) In the next stage theselected descriptors were used to derive non-linear modelsusing SVR and ANN To construct an ANN the Levenberg-Marquardt algorithm [39] was used by nftool toolbox ofMATLAB 78 software to train the network Selected descrip-tors and the bitter activity of training sets were introducedas input and output values respectively The training set wasrandomly classified for training validation and test sets inorder to avoid overfitting and then networks were trainedSVR another non-linear model constructs a hyperplane ina multidimensional space that provides minimum error byemploying a non-linear Kernel function Some parameterssuch as capacity parameter (119862) 120576 are related to type of noisein the data and 120574 is related to radial base function (RBF)which is the most common type of Kernel functions TheSVMmodel and optimization of parameters were done usingSTATISTICA 7 software

29 Model Validation The developed models were evaluatedusing the leave-many-out (LMO) cross-validation methodThe 1199022 values were calculated using (A8) of the appendix

210 Chance Correlation To check the possibility of a chancecorrelation 10 times shuffled activities were correlated tothe variables and the produced regression coefficients werecompared with the developed model regression coefficient

211 External Validation of Proposed Models Using the Exter-nal Test Set A set of statistical criteria [30 31] was employedto analyze 36 data points of the test set

(1) 1198772 gt 06

where 1198772 is the coefficient of determination between thepredicted and observed values

(2) (1198772 minus 11987720)1198772lt 01 and 085 le 119870 le 115 or (1198772 minus

11987710158402

0)1198772lt 01 and 085 le 1198701015840 le 115

where 11987720is the coefficient of determination obtained using

predicted values relative to a regression line fit for experimen-tal values and required to pass through the origin and 11987710158402

0

is the corresponding coefficient obtained using experimentalvalues relative to a regression line fit for predicted values andrequired to pass through the origin 119870 and 1198701015840 are the slopesof regression lines through the origin for fits for experimentaland predicted data respectively

(3) |11987720minus 11987710158402

0| lt 03

In addition another criterion proposed by P P Roy andK Roy [32] was considered as

(4) 1198772119898= 1198772(1 minus radic119877

2minus 1198772

0)

in which 1198772119898gt 05 indicates the good external predictability

of the QSAR models

3 Results and Discussion

31 Outlier Detection According to standard score analysisthere is no outlier The PCA map of scores showed that mostof the data points fell into the acceptable data space and thereis no outlier in the studied data set

32 Training-Test Selection The selected training test andvalidation data sets are listed in Table 2 along with thecomputed accuracy criteria The activity ranges for trainingtest and validation sets were 100ndash540 116ndash570 and 149ndash540 respectivelyThe number of data points for the trainingtest and validation sets was 181 36 and 10 respectively

33 Descriptor Selection A total of 1292 descriptors were cal-culated using Dragon 54 softwareThis number was reducedto 244 descriptors after correlation analysis was performedusing a home-developed toolbox The remaining descriptorswere analyzed using the GA-PLS method and the best scor-ing descriptors (descriptors included inTable 3)were selectedfor further analysis using stepwise regression and bivariatecross-correlation studies (the intercorrelation between theselected descriptors was less than 09) After excluding thenonsignificant descriptors the remaining descriptors werechecked to find the possible intercorrelation and the finaldescriptors were selected using a stepwise regression method(see Table 4) The final six selected descriptors were againchecked for intercorrelation Six descriptors are suitable forthe construction of a QSAR model [40]

34 Evaluation of the Selected Descriptors SPAN belongsto the size descriptors which evaluate the dimension ofthe molecule and often calculate it from the moleculargeometry This descriptor is a suitable size descriptor formacromolecules [41] and its selection for the studied dataset with the developed method showed the method to besuitable for descriptor selectionThe correlation study results(Figure 2) showed that bitterness increased upon theenhancement of the SPAN (Figure 2(a)) This finding is inagreement with previous findings [1ndash13] which suggested asignificant correlation between peptide size and bitternessThe investigation of the correlation of peptide subsets withthis descriptor showed less correlations in comparison withthe whole dataset Therefore SPAN can be regarded as ageneral descriptor rather than a specific one for peptidesubsets Thus SPAN can be used in the primary steps ofpeptide bioactive compound discovery to recognize bittercompound fractions

Mean square distance (MSD) index (Balaban) [28 41]contributes negatively to bitterness (Figure 2(b)) MSD de-creases with the increasing of molecular branching in anisomeric set [41] Furthermore it decreases with an increasein the number of atomsThis is in agreementwith the findingson the increase in bitterness following an increase in thenumber of amino acids in the studied peptides The lowercorrelation coefficient between peptide subsets by MSD incomparison with the total dataset showed that MSD isa general bitterness indicator rather than a subset bitternessdescriptor

8 BioMed Research International

Table 3 GA-PLS selected descriptors along with descriptive fortraining data set (181 peptides)

Name Range Minimum Maximum Mean Std deviationrdf025p 11270 407 11677 3090 2113mor15u 381 minus190 191 018 065eeig08x 421 minus024 397 216 121mor11m 213 minus121 092 minus015 040rdf025v 10821 400 11220 2952 2028mor15e 373 minus157 216 026 070mor21m 386 minus393 minus008 minus090 064belp7 188 000 188 118 042rdf075m 14442 000 14442 1628 2418mor31p 185 014 198 054 036ggi6 387 000 387 077 074mor11v 254 minus100 155 010 040mor31v 159 011 170 045 031ncs 2900 100 3000 726 550ti1 621722 minus8725 612997 25545 69986e3s 064 000 064 021 010gmti 131885700 34700 131920400 8106404 17363751vep1 818 261 1078 528 163ats1p 288 181 468 312 060smtiv 77560100 34100 77594200 4840738 10318764behv6 284 096 379 282 053alogp 807 minus281 526 038 145rtp a 2563 490 3053 1302 531c002 1800 000 1800 413 338j3d 1070 241 1310 535 246idmt 142611569 18231 142629800 7586086 18440059mats2m 032 minus016 016 004 006mor11p 280 minus129 151 005 043mats2v 142611569 18231 142629800 7586086 18440059hats8u 070 000 070 033 013smti 47759800 16800 47776600 2905336 6331178mor05u 3643 minus3823 minus180 minus1014 651hats8e 068 000 068 034 013mats2e 031 minus015 015 003 006l1u 5345 241 5586 1298 933rbn 4600 100 4700 1161 803h6m 181 000 181 034 036rdf080e 38799 000 38799 3941 6182rtv a 2323 399 2721 1150 482

Bitterness increases with the enhancement of E3s(Figure 2(c)) E3s is one of the accessibility directionalWHIM indiceswhich isweighed by atomic electrotopologicalstates [28 41] Indeed this descriptor defines size shapepolarizability and the conformational properties of the stud-ied molecules together Its correlation with the bitterness isweaker than MSD and SPAN descriptors but its removalfrom the model decreases the correlation coefficient signif-icantly It is not a strong identifier for utilization in drugdevelopment processes

Table 4 Intercorrelation of final descriptors

log(1119879) SPAN Mor11v MSD HATS8u G3p E3slog(1119879) 100SPAN 079 100Mor11v 034 009 100MSD minus081 minus078 minus020 100HATS8u minus056 minus052 minus025 038 100G3p minus063 minus064 minus013 064 030 100E3s 041 022 031 minus038 minus016 minus025 100

G3p the 3rd-component symmetry directional WHIMindex (weighed by polarizability) [28 41] showed a negativecorrelation with bitterness (Figure 2(d)) Along with E3sthese descriptors contribute to the electrical properties of themolecule G3p decreases with the increase in the number ofamino acids In fact more bitter compounds possess largerG3p values Similar to the SPAN and MSD descriptors thecorrelations with subsets are less than the general datasetand this descriptor can be regarded as a general identifierHATS8u (Figure 2(e)) belongs to the GETAWAY descriptorsThese descriptors are molecular descriptors derived fromthe molecular influence matrix (MIM) [12 28] HATSindices known as spatial autocorrelation based descriptorsencode information nonstructural fragments [28 41] Theyare suitable for describing differences in a congeneric seriesof molecules The effective position of substituent andfragments in molecular space and information about themolecular size and shape can be encoded by unweightedHATS indices Moreover they are independent of moleculealignment

In summary GETAWAY descriptors encoded local infor-mation while WHIM descriptors related to the holisticinformation of the molecules thus their joint uses areadvised HATS8u decreases with an increase in molecularsize The negative relation of this descriptor with bitterness isin agreement with the findings about the size and bitternessrelation

3D-MoRSE descriptors (3D molecule representation ofstructures based on electron diffraction) are derived frominfrared spectra simulation using a generalized scatteringfunction [28] Mor11v weighed by van der Waals volumebelongs to these descriptors which can be regarded as anindicator of size mass and volume of the molecules Bydecreasing the size of the peptide the Mor11v values tendtoward zero (Figure 2(f)) In fact the absolute values ofMor11v decrease by the decrease in size of the molecule Theabsolute quantity of it worsens the model and it should beused in the model in the present form including negativeand positive values for different peptides The only similartrend was observed for alogp values of the peptides Thenegative Mor11v values belong to the molecules with neg-ative alogp values (less hydrophobic peptides) 3D-MoRSEdescriptors cannot encode the lipophilicity of the moleculesThe overall correlation of bitterness to this descriptor is posi-tive [28]

BioMed Research International 9

0

5

10

15

20

0 1 2 3 4 5 6

SPA

N

y = 21303x + 11869

R2 = 06217

log(1T)

(a)

0

01

02

03

04

05

0 1 2 3 4 5 6

MSD

y = minus00384x + 03733

R2 = 06613

log(1T)

(b)

0

02

04

06

08

0 1 2 3 4 5 6

E3S

y = 00419x + 00847

R2 = 01667

log(1T)(c)

G3p

01

01

02

02

03

0 1 2 3 4 5 6

y = minus00132x + 02095

R2 = 0396

log(1T)

(d)

00 2

0807060504030201

4 6

HAT

S8u

y = minus00744x + 05511

R2 = 0313

log(1T)(e)

0

1

2

15

05

0 2 4 6mor

11v

minus05

minus1

minus15

y = 01342x minus 03043

R2 = 01127

log(1T)

(f)

Figure 2 Correlation of selected descriptors with bitter activity (1198772 is coefficient of determination and is calculated using (A6) of theappendix)

35 Model Building Using Linear Model (MLR) The selecteddescriptors were used to develop six MLR equations contain-ing 1ndash6 descriptors The adjusted 1198772 showed that all descrip-tors improved the model fitting the 1198772 values accuratelydepicted the fitting improvement and the best-fitted modelcould be rewritten as follows

log( 1119879

) = 545 (plusmn063) + 010 (plusmn002) SPAN

+ 032 (plusmn009)Mor11v

minus 788 (plusmn125)MSD

minus 155 (plusmn030)HATS8u

minus 539 (plusmn21)G3p + 092 (plusmn035)E3s

1198772= 081 119865 = 12573

SEP = 008 MPD = 137 (plusmn132)

ND = 181(4)

where SEP stands for standard error of prediction and NDshows the number of data points The MPD for test andvalidation sets were 181 (plusmn155) (119873 = 36) and 169 (plusmn126)(119873 = 10) respectively The relative frequency analysis ofthe prediction errors showed that more than 50 of data canbe predicted by the prediction error of less than 15 whichis acceptable for biological measurements where the meanILRSD for bitter activities of 19 dipeptides were measured bydifferent research groups and were 121 (plusmn107) In additionthe IPD frequency trend (Figure 3) is similar for trainingand test sets The MPD for the peptides by log(1119879) valuesless than 20 was 243 (plusmn222) and for the peptides with

10 BioMed Research International

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

trai

ning

set

(a)

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

test

set

(b)

0

10

20

30

40

50

60

70

MLR SVM ANN

IPD

freq

uenc

y of

val

idat

ion

set

lt15

15ndash30gt30

(c)

Figure 3 IPD frequencies (IPD lt 15 IPD = 15ndash30 and IPD gt 30)of the training (top) test (middle) and validation (bottom) sets forMLR SVM and ANNmodels

log(1119879) more than 20 it was 111 (plusmn79) The highest IPDvalue (1000) in the training set was calculated for GR(log(1119879) = 100) and in the test set it was calculated forPGR (IPD = 654 and log(1119879) = 160) In other wordsthe developed model was not able to predict the bitterness ofless bitter peptides and the selected descriptors are specificfor the evaluation of the bitter peptidesrsquo interactions with thereceptor

LOO cross-validation was done using the PLS toolbox ofMATLAB software The 1199022 value and RMSE were 076 and044 respectivelyThe 1199022 values for LMO cross-validation arereported in Table 3 The ranges of 1199022 values were 061ndash088which was in an acceptable range compared with 1198772 values

Table 5 Leave-many-out cross-validation results for MLR model

Subset 1198772

1198772

adj 1199022

1 080 079 0882 080 079 0873 080 079 0864 080 080 0825 080 080 0826 081 080 0757 081 080 0788 082 081 0769 082 081 08410 082 082 061

Table 6 Chance correlation results

Shuffled 119884 1198772 Shuffled 119884 119877

2

1198841 001 1198846 0021198842 007 1198847 minus0021198843 001 1198848 0031198844 minus002 1198849 0011198845 001 11988410 000

The ranges of corresponding 1198772 and adjusted 1198772 for thedesired subsets were 080ndash082 and 079ndash082 respectively(see Table 5) Chance correlation (119884 randomization) analysiswas done using 10 times shuffled bitter activity and the results(1198772 = 001ndash010) rejected the possibility of fortune correlation(see Table 6)

36 Model Building Using Nonlinear Models (ANN and SVM)and Comparison with the Linear Model (MLR) The sixselected descriptors were introduced to ANN as input valuesand the bitter activity as output data and the networks weredeveloped using the Levenberg-Marquardt algorithm [38]The number of hidden layers was three In addition SVMmodels were developed using STATISTICA 7 software Usingthe training set three parameters of SVM were optimized by10-fold cross-validation The optimized values of 119862 120576 and 120574were 91 007 and 006 respectively The MPD values of theproposed models for training test and validation sets were138 168 and 169 for MLR 130 160 and 150 for SVM and131 161 and 148 for ANN all of which are in acceptableranges and there is no significant difference between thesesubsets The IPD frequencies (Figure 3) revealed that bothSVM and ANN methods produced more accurate resultsespecially for the test sets The plots of experimental versuspredicted values (Figure 4) confirmed the discussed results

37 External Validation of the Proposed Models Statisticalcriteria of external test sets are depicted in Table 7 Theresults show that the proposed models using linear andnonlinear models passed the proposed statistical criteria in

BioMed Research International 11

MLR

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

SVM

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

ANN

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

y = 0785x + 0625

R = 0892y = 0784x + 0596

R = 0899

y = 0800x + 0624

R = 0907

Figure 4 Experimental versus predicted plots for linear and nonlinear models (119877 is calculated using (A7) of the appendix)

Table 7 Statistical parameters for test sets of MLR SVM and ANNmodels

Statistical criteria MLR SVM ANN1198772gt 06 0723 0739 0767(1198772minus 1198772

0) 1198772lt 01 0001 0002 0004

085 le 119870 le 115 0999 1010 098910038161003816100381610038161198772

0minus 11987710158402

0

1003816100381610038161003816lt 03 0109 0113 0106

1198772

119898gt 05 0704 0712 0722

the literature and they are robust and valid for externalprediction

38 Comparison with Previous Model The only similarstudy was published by Kim and Li-Chan [3] They usedthe amino acid three 119911-scores along with three parameters(total hydrophobicity residue number and logmass value) todevelop a PLS model A comparison of the newly developedmodel with their model is summarized in Table 8 It shouldbe noted that the aim of this paper was to develop a generalmodel rather than different models for different subsets andreported correlation coefficients belong to the general modelwhich was computed for subsets The comparison of thecorresponding 119877 values showed that the developed modelcould represent the bitter activity variance of three and morepeptides better than the PLS model while for dipeptides

the PLS model produced more accurate results DevelopedSVM and ANN models resulted in more accurate results forthe total data set compared with both the PLS and MLRmodels (Tables 1 and 5 Figure 4)

4 Conclusion

General MLR SVM and ANN models were developed topredict the bitterness of 229 peptides and amino acidsThe capability of the MLR model to reveal the impact ofeach descriptor on bitter activity was its main advantagewhere more accurate predictions by SVM and ANN madethem suitable models for precise predictions Obviouslyindividual models (ie models developed for each peptidesubset) produced less prediction errors but consideringthe convenience of application of the general models suchmodels are preferred during the primary stages of peptideproduction and evaluation The developed models can beused in nutraceutical and pharmaceutical industries

Appendix

RMSE = radicsum(119884pred minus 119884exp(mean))

2

119873

(A1)

where 119873 denotes the number of data points and 119884pred and119884exp are the predicted and observed log 1119879 (119879 is the bitter

12 BioMed Research International

Table 8 Developed MLR model statistics for subsets of peptides compared with a previously developed model

Data set Developed model Previous modelND 119877

21198772

adj 119877 RMSE 119877 (PLS) 1198772 (PLS)

Dipeptidesa 76 051 047 072 041 063 040Dipeptidesb 45 062 056 079 044 091 083Dipeptidesc 47 059 053 077 042 085 072Three peptides 51 065 060 081 038 071 050Tetrapeptides 23 072 062 085 048 090 081Pentapeptides 12 089 080 094 037 088 077Hexapeptides 20 065 048 080 047 075 056Heptapeptides 16 079 068 089 038 095 090Octa-tetradecapeptides 24 057 042 076 044 mdash mdashWhole data setd 227 080 079 089 046 081 066Test and validation sets 46 076 073 087 056 mdash mdashTraining set 181 081 081 090 043 mdash mdashaAverage of experimental values was used when there were different values in different referencesbExperimental data were taken from different referencescExperimental data were taken from [3]dThe 119877 values for whole data set using SVM and ANNmethods are 090 and 091 respectively

threshold concentration (119872)) values

MSSReg =sum SSReg

df (A2)

MSSRes =sum SSRes

df (A3)

where SS is sum of squares MSS is mean square Reg isregression Res is residual and cv is cross validated

MSSTotal = MSSReg +MSSRes (A4)

119865 =

MSSRegMSSRes

(A5)

1198772= 1 minus

MSSResMSSTotal

(A6)

119877 = (1 minus

MSSResMSSTotal

)

05

(A7)

1199022= 1198772

cv = 1 minusMSSRescvMSSTotal

(A8)

Conflict of Interests

The authors certify that there is no financial relation betweenthem and the mentioned commercial identities in the paper

Acknowledgments

Theauthors thank the Biotechnology Research Center TabrizUniversity of Medical Sciences for a partial financial supportunder Grant no 585765

References

[1] A H Pripp and Y Ardo ldquoModelling relationship betweenangiotensin-(I)-converting enzyme inhibition and the bittertaste of peptidesrdquo Food Chemistry vol 102 no 3 pp 880ndash8882007

[2] A H Pripp T Isaksson L Stepaniak T Soslashrhaug and Y ArdoldquoQuantitative structure activity relationship modelling of pep-tides and proteins as a tool in food sciencerdquo Trends in FoodScience and Technology vol 16 no 11 pp 484ndash494 2005

[3] H O Kim and E C Y Li-Chan ldquoQuantitative structure-activityrelationship study of bitter peptidesrdquo Journal of Agricultural andFood Chemistry vol 54 no 26 pp 10102ndash10111 2006

[4] K Maehashi and L Huang ldquoBitter peptides and bitter tastereceptorsrdquo Cellular and Molecular Life Sciences vol 66 no 10pp 1661ndash1671 2009

[5] E G Walsh B E Adamczyk K B Chalasani et al ldquoOral deliv-ery of macromolecules rationale underpinning gastrointestinalpermeation enhancement technology (GIPET)rdquo TherapeuticDelivery vol 2 no 12 pp 1595ndash1610 2011

[6] R C Pinto C Silva N Martinho et al ldquoDrug carriers for oraldelivery of peptides and proteins accomplishments and futureperspectivesrdquoTherapeutic Delivery vol 4 pp 251ndash265 2013

[7] J Renukuntla A D Vadlapudi A Patel et al ldquoApproachesfor enhancing oral bioavailability of peptides and proteinsrdquoInternational Journal of Pharmaceutics vol 447 pp 75ndash93 2013

[8] S Swain ldquoStabilization and delivery approaches for protein andpeptide pharmaceuticals an extensive review of patentsrdquoRecentPatents on Biotechnology In press

[9] K Maehashi M Matano H Wang L A Vo Y Yamamoto andL Huang ldquoBitter peptides activate hTAS2Rs the human bitterreceptorsrdquo Biochemical and Biophysical Research Communica-tions vol 365 no 4 pp 851ndash855 2008

[10] K H Ney ldquoPrediction of bitterness of peptides from theiramino acid compositionrdquo Zeitschrift fur Lebensmittel-Untersu-chung und -Forschung vol 147 no 2 pp 64ndash68 1971

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003

8 BioMed Research International

Table 3 GA-PLS selected descriptors along with descriptive fortraining data set (181 peptides)

Name Range Minimum Maximum Mean Std deviationrdf025p 11270 407 11677 3090 2113mor15u 381 minus190 191 018 065eeig08x 421 minus024 397 216 121mor11m 213 minus121 092 minus015 040rdf025v 10821 400 11220 2952 2028mor15e 373 minus157 216 026 070mor21m 386 minus393 minus008 minus090 064belp7 188 000 188 118 042rdf075m 14442 000 14442 1628 2418mor31p 185 014 198 054 036ggi6 387 000 387 077 074mor11v 254 minus100 155 010 040mor31v 159 011 170 045 031ncs 2900 100 3000 726 550ti1 621722 minus8725 612997 25545 69986e3s 064 000 064 021 010gmti 131885700 34700 131920400 8106404 17363751vep1 818 261 1078 528 163ats1p 288 181 468 312 060smtiv 77560100 34100 77594200 4840738 10318764behv6 284 096 379 282 053alogp 807 minus281 526 038 145rtp a 2563 490 3053 1302 531c002 1800 000 1800 413 338j3d 1070 241 1310 535 246idmt 142611569 18231 142629800 7586086 18440059mats2m 032 minus016 016 004 006mor11p 280 minus129 151 005 043mats2v 142611569 18231 142629800 7586086 18440059hats8u 070 000 070 033 013smti 47759800 16800 47776600 2905336 6331178mor05u 3643 minus3823 minus180 minus1014 651hats8e 068 000 068 034 013mats2e 031 minus015 015 003 006l1u 5345 241 5586 1298 933rbn 4600 100 4700 1161 803h6m 181 000 181 034 036rdf080e 38799 000 38799 3941 6182rtv a 2323 399 2721 1150 482

Bitterness increases with the enhancement of E3s(Figure 2(c)) E3s is one of the accessibility directionalWHIM indiceswhich isweighed by atomic electrotopologicalstates [28 41] Indeed this descriptor defines size shapepolarizability and the conformational properties of the stud-ied molecules together Its correlation with the bitterness isweaker than MSD and SPAN descriptors but its removalfrom the model decreases the correlation coefficient signif-icantly It is not a strong identifier for utilization in drugdevelopment processes

Table 4 Intercorrelation of final descriptors

log(1119879) SPAN Mor11v MSD HATS8u G3p E3slog(1119879) 100SPAN 079 100Mor11v 034 009 100MSD minus081 minus078 minus020 100HATS8u minus056 minus052 minus025 038 100G3p minus063 minus064 minus013 064 030 100E3s 041 022 031 minus038 minus016 minus025 100

G3p the 3rd-component symmetry directional WHIMindex (weighed by polarizability) [28 41] showed a negativecorrelation with bitterness (Figure 2(d)) Along with E3sthese descriptors contribute to the electrical properties of themolecule G3p decreases with the increase in the number ofamino acids In fact more bitter compounds possess largerG3p values Similar to the SPAN and MSD descriptors thecorrelations with subsets are less than the general datasetand this descriptor can be regarded as a general identifierHATS8u (Figure 2(e)) belongs to the GETAWAY descriptorsThese descriptors are molecular descriptors derived fromthe molecular influence matrix (MIM) [12 28] HATSindices known as spatial autocorrelation based descriptorsencode information nonstructural fragments [28 41] Theyare suitable for describing differences in a congeneric seriesof molecules The effective position of substituent andfragments in molecular space and information about themolecular size and shape can be encoded by unweightedHATS indices Moreover they are independent of moleculealignment

In summary GETAWAY descriptors encoded local infor-mation while WHIM descriptors related to the holisticinformation of the molecules thus their joint uses areadvised HATS8u decreases with an increase in molecularsize The negative relation of this descriptor with bitterness isin agreement with the findings about the size and bitternessrelation

3D-MoRSE descriptors (3D molecule representation ofstructures based on electron diffraction) are derived frominfrared spectra simulation using a generalized scatteringfunction [28] Mor11v weighed by van der Waals volumebelongs to these descriptors which can be regarded as anindicator of size mass and volume of the molecules Bydecreasing the size of the peptide the Mor11v values tendtoward zero (Figure 2(f)) In fact the absolute values ofMor11v decrease by the decrease in size of the molecule Theabsolute quantity of it worsens the model and it should beused in the model in the present form including negativeand positive values for different peptides The only similartrend was observed for alogp values of the peptides Thenegative Mor11v values belong to the molecules with neg-ative alogp values (less hydrophobic peptides) 3D-MoRSEdescriptors cannot encode the lipophilicity of the moleculesThe overall correlation of bitterness to this descriptor is posi-tive [28]

BioMed Research International 9

0

5

10

15

20

0 1 2 3 4 5 6

SPA

N

y = 21303x + 11869

R2 = 06217

log(1T)

(a)

0

01

02

03

04

05

0 1 2 3 4 5 6

MSD

y = minus00384x + 03733

R2 = 06613

log(1T)

(b)

0

02

04

06

08

0 1 2 3 4 5 6

E3S

y = 00419x + 00847

R2 = 01667

log(1T)(c)

G3p

01

01

02

02

03

0 1 2 3 4 5 6

y = minus00132x + 02095

R2 = 0396

log(1T)

(d)

00 2

0807060504030201

4 6

HAT

S8u

y = minus00744x + 05511

R2 = 0313

log(1T)(e)

0

1

2

15

05

0 2 4 6mor

11v

minus05

minus1

minus15

y = 01342x minus 03043

R2 = 01127

log(1T)

(f)

Figure 2 Correlation of selected descriptors with bitter activity (1198772 is coefficient of determination and is calculated using (A6) of theappendix)

35 Model Building Using Linear Model (MLR) The selecteddescriptors were used to develop six MLR equations contain-ing 1ndash6 descriptors The adjusted 1198772 showed that all descrip-tors improved the model fitting the 1198772 values accuratelydepicted the fitting improvement and the best-fitted modelcould be rewritten as follows

log( 1119879

) = 545 (plusmn063) + 010 (plusmn002) SPAN

+ 032 (plusmn009)Mor11v

minus 788 (plusmn125)MSD

minus 155 (plusmn030)HATS8u

minus 539 (plusmn21)G3p + 092 (plusmn035)E3s

1198772= 081 119865 = 12573

SEP = 008 MPD = 137 (plusmn132)

ND = 181(4)

where SEP stands for standard error of prediction and NDshows the number of data points The MPD for test andvalidation sets were 181 (plusmn155) (119873 = 36) and 169 (plusmn126)(119873 = 10) respectively The relative frequency analysis ofthe prediction errors showed that more than 50 of data canbe predicted by the prediction error of less than 15 whichis acceptable for biological measurements where the meanILRSD for bitter activities of 19 dipeptides were measured bydifferent research groups and were 121 (plusmn107) In additionthe IPD frequency trend (Figure 3) is similar for trainingand test sets The MPD for the peptides by log(1119879) valuesless than 20 was 243 (plusmn222) and for the peptides with

10 BioMed Research International

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

trai

ning

set

(a)

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

test

set

(b)

0

10

20

30

40

50

60

70

MLR SVM ANN

IPD

freq

uenc

y of

val

idat

ion

set

lt15

15ndash30gt30

(c)

Figure 3 IPD frequencies (IPD lt 15 IPD = 15ndash30 and IPD gt 30)of the training (top) test (middle) and validation (bottom) sets forMLR SVM and ANNmodels

log(1119879) more than 20 it was 111 (plusmn79) The highest IPDvalue (1000) in the training set was calculated for GR(log(1119879) = 100) and in the test set it was calculated forPGR (IPD = 654 and log(1119879) = 160) In other wordsthe developed model was not able to predict the bitterness ofless bitter peptides and the selected descriptors are specificfor the evaluation of the bitter peptidesrsquo interactions with thereceptor

LOO cross-validation was done using the PLS toolbox ofMATLAB software The 1199022 value and RMSE were 076 and044 respectivelyThe 1199022 values for LMO cross-validation arereported in Table 3 The ranges of 1199022 values were 061ndash088which was in an acceptable range compared with 1198772 values

Table 5 Leave-many-out cross-validation results for MLR model

Subset 1198772

1198772

adj 1199022

1 080 079 0882 080 079 0873 080 079 0864 080 080 0825 080 080 0826 081 080 0757 081 080 0788 082 081 0769 082 081 08410 082 082 061

Table 6 Chance correlation results

Shuffled 119884 1198772 Shuffled 119884 119877

2

1198841 001 1198846 0021198842 007 1198847 minus0021198843 001 1198848 0031198844 minus002 1198849 0011198845 001 11988410 000

The ranges of corresponding 1198772 and adjusted 1198772 for thedesired subsets were 080ndash082 and 079ndash082 respectively(see Table 5) Chance correlation (119884 randomization) analysiswas done using 10 times shuffled bitter activity and the results(1198772 = 001ndash010) rejected the possibility of fortune correlation(see Table 6)

36 Model Building Using Nonlinear Models (ANN and SVM)and Comparison with the Linear Model (MLR) The sixselected descriptors were introduced to ANN as input valuesand the bitter activity as output data and the networks weredeveloped using the Levenberg-Marquardt algorithm [38]The number of hidden layers was three In addition SVMmodels were developed using STATISTICA 7 software Usingthe training set three parameters of SVM were optimized by10-fold cross-validation The optimized values of 119862 120576 and 120574were 91 007 and 006 respectively The MPD values of theproposed models for training test and validation sets were138 168 and 169 for MLR 130 160 and 150 for SVM and131 161 and 148 for ANN all of which are in acceptableranges and there is no significant difference between thesesubsets The IPD frequencies (Figure 3) revealed that bothSVM and ANN methods produced more accurate resultsespecially for the test sets The plots of experimental versuspredicted values (Figure 4) confirmed the discussed results

37 External Validation of the Proposed Models Statisticalcriteria of external test sets are depicted in Table 7 Theresults show that the proposed models using linear andnonlinear models passed the proposed statistical criteria in

BioMed Research International 11

MLR

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

SVM

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

ANN

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

y = 0785x + 0625

R = 0892y = 0784x + 0596

R = 0899

y = 0800x + 0624

R = 0907

Figure 4 Experimental versus predicted plots for linear and nonlinear models (119877 is calculated using (A7) of the appendix)

Table 7 Statistical parameters for test sets of MLR SVM and ANNmodels

Statistical criteria MLR SVM ANN1198772gt 06 0723 0739 0767(1198772minus 1198772

0) 1198772lt 01 0001 0002 0004

085 le 119870 le 115 0999 1010 098910038161003816100381610038161198772

0minus 11987710158402

0

1003816100381610038161003816lt 03 0109 0113 0106

1198772

119898gt 05 0704 0712 0722

the literature and they are robust and valid for externalprediction

38 Comparison with Previous Model The only similarstudy was published by Kim and Li-Chan [3] They usedthe amino acid three 119911-scores along with three parameters(total hydrophobicity residue number and logmass value) todevelop a PLS model A comparison of the newly developedmodel with their model is summarized in Table 8 It shouldbe noted that the aim of this paper was to develop a generalmodel rather than different models for different subsets andreported correlation coefficients belong to the general modelwhich was computed for subsets The comparison of thecorresponding 119877 values showed that the developed modelcould represent the bitter activity variance of three and morepeptides better than the PLS model while for dipeptides

the PLS model produced more accurate results DevelopedSVM and ANN models resulted in more accurate results forthe total data set compared with both the PLS and MLRmodels (Tables 1 and 5 Figure 4)

4 Conclusion

General MLR SVM and ANN models were developed topredict the bitterness of 229 peptides and amino acidsThe capability of the MLR model to reveal the impact ofeach descriptor on bitter activity was its main advantagewhere more accurate predictions by SVM and ANN madethem suitable models for precise predictions Obviouslyindividual models (ie models developed for each peptidesubset) produced less prediction errors but consideringthe convenience of application of the general models suchmodels are preferred during the primary stages of peptideproduction and evaluation The developed models can beused in nutraceutical and pharmaceutical industries

Appendix

RMSE = radicsum(119884pred minus 119884exp(mean))

2

119873

(A1)

where 119873 denotes the number of data points and 119884pred and119884exp are the predicted and observed log 1119879 (119879 is the bitter

12 BioMed Research International

Table 8 Developed MLR model statistics for subsets of peptides compared with a previously developed model

Data set Developed model Previous modelND 119877

21198772

adj 119877 RMSE 119877 (PLS) 1198772 (PLS)

Dipeptidesa 76 051 047 072 041 063 040Dipeptidesb 45 062 056 079 044 091 083Dipeptidesc 47 059 053 077 042 085 072Three peptides 51 065 060 081 038 071 050Tetrapeptides 23 072 062 085 048 090 081Pentapeptides 12 089 080 094 037 088 077Hexapeptides 20 065 048 080 047 075 056Heptapeptides 16 079 068 089 038 095 090Octa-tetradecapeptides 24 057 042 076 044 mdash mdashWhole data setd 227 080 079 089 046 081 066Test and validation sets 46 076 073 087 056 mdash mdashTraining set 181 081 081 090 043 mdash mdashaAverage of experimental values was used when there were different values in different referencesbExperimental data were taken from different referencescExperimental data were taken from [3]dThe 119877 values for whole data set using SVM and ANNmethods are 090 and 091 respectively

threshold concentration (119872)) values

MSSReg =sum SSReg

df (A2)

MSSRes =sum SSRes

df (A3)

where SS is sum of squares MSS is mean square Reg isregression Res is residual and cv is cross validated

MSSTotal = MSSReg +MSSRes (A4)

119865 =

MSSRegMSSRes

(A5)

1198772= 1 minus

MSSResMSSTotal

(A6)

119877 = (1 minus

MSSResMSSTotal

)

05

(A7)

1199022= 1198772

cv = 1 minusMSSRescvMSSTotal

(A8)

Conflict of Interests

The authors certify that there is no financial relation betweenthem and the mentioned commercial identities in the paper

Acknowledgments

Theauthors thank the Biotechnology Research Center TabrizUniversity of Medical Sciences for a partial financial supportunder Grant no 585765

References

[1] A H Pripp and Y Ardo ldquoModelling relationship betweenangiotensin-(I)-converting enzyme inhibition and the bittertaste of peptidesrdquo Food Chemistry vol 102 no 3 pp 880ndash8882007

[2] A H Pripp T Isaksson L Stepaniak T Soslashrhaug and Y ArdoldquoQuantitative structure activity relationship modelling of pep-tides and proteins as a tool in food sciencerdquo Trends in FoodScience and Technology vol 16 no 11 pp 484ndash494 2005

[3] H O Kim and E C Y Li-Chan ldquoQuantitative structure-activityrelationship study of bitter peptidesrdquo Journal of Agricultural andFood Chemistry vol 54 no 26 pp 10102ndash10111 2006

[4] K Maehashi and L Huang ldquoBitter peptides and bitter tastereceptorsrdquo Cellular and Molecular Life Sciences vol 66 no 10pp 1661ndash1671 2009

[5] E G Walsh B E Adamczyk K B Chalasani et al ldquoOral deliv-ery of macromolecules rationale underpinning gastrointestinalpermeation enhancement technology (GIPET)rdquo TherapeuticDelivery vol 2 no 12 pp 1595ndash1610 2011

[6] R C Pinto C Silva N Martinho et al ldquoDrug carriers for oraldelivery of peptides and proteins accomplishments and futureperspectivesrdquoTherapeutic Delivery vol 4 pp 251ndash265 2013

[7] J Renukuntla A D Vadlapudi A Patel et al ldquoApproachesfor enhancing oral bioavailability of peptides and proteinsrdquoInternational Journal of Pharmaceutics vol 447 pp 75ndash93 2013

[8] S Swain ldquoStabilization and delivery approaches for protein andpeptide pharmaceuticals an extensive review of patentsrdquoRecentPatents on Biotechnology In press

[9] K Maehashi M Matano H Wang L A Vo Y Yamamoto andL Huang ldquoBitter peptides activate hTAS2Rs the human bitterreceptorsrdquo Biochemical and Biophysical Research Communica-tions vol 365 no 4 pp 851ndash855 2008

[10] K H Ney ldquoPrediction of bitterness of peptides from theiramino acid compositionrdquo Zeitschrift fur Lebensmittel-Untersu-chung und -Forschung vol 147 no 2 pp 64ndash68 1971

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003

BioMed Research International 9

0

5

10

15

20

0 1 2 3 4 5 6

SPA

N

y = 21303x + 11869

R2 = 06217

log(1T)

(a)

0

01

02

03

04

05

0 1 2 3 4 5 6

MSD

y = minus00384x + 03733

R2 = 06613

log(1T)

(b)

0

02

04

06

08

0 1 2 3 4 5 6

E3S

y = 00419x + 00847

R2 = 01667

log(1T)(c)

G3p

01

01

02

02

03

0 1 2 3 4 5 6

y = minus00132x + 02095

R2 = 0396

log(1T)

(d)

00 2

0807060504030201

4 6

HAT

S8u

y = minus00744x + 05511

R2 = 0313

log(1T)(e)

0

1

2

15

05

0 2 4 6mor

11v

minus05

minus1

minus15

y = 01342x minus 03043

R2 = 01127

log(1T)

(f)

Figure 2 Correlation of selected descriptors with bitter activity (1198772 is coefficient of determination and is calculated using (A6) of theappendix)

35 Model Building Using Linear Model (MLR) The selecteddescriptors were used to develop six MLR equations contain-ing 1ndash6 descriptors The adjusted 1198772 showed that all descrip-tors improved the model fitting the 1198772 values accuratelydepicted the fitting improvement and the best-fitted modelcould be rewritten as follows

log( 1119879

) = 545 (plusmn063) + 010 (plusmn002) SPAN

+ 032 (plusmn009)Mor11v

minus 788 (plusmn125)MSD

minus 155 (plusmn030)HATS8u

minus 539 (plusmn21)G3p + 092 (plusmn035)E3s

1198772= 081 119865 = 12573

SEP = 008 MPD = 137 (plusmn132)

ND = 181(4)

where SEP stands for standard error of prediction and NDshows the number of data points The MPD for test andvalidation sets were 181 (plusmn155) (119873 = 36) and 169 (plusmn126)(119873 = 10) respectively The relative frequency analysis ofthe prediction errors showed that more than 50 of data canbe predicted by the prediction error of less than 15 whichis acceptable for biological measurements where the meanILRSD for bitter activities of 19 dipeptides were measured bydifferent research groups and were 121 (plusmn107) In additionthe IPD frequency trend (Figure 3) is similar for trainingand test sets The MPD for the peptides by log(1119879) valuesless than 20 was 243 (plusmn222) and for the peptides with

10 BioMed Research International

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

trai

ning

set

(a)

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

test

set

(b)

0

10

20

30

40

50

60

70

MLR SVM ANN

IPD

freq

uenc

y of

val

idat

ion

set

lt15

15ndash30gt30

(c)

Figure 3 IPD frequencies (IPD lt 15 IPD = 15ndash30 and IPD gt 30)of the training (top) test (middle) and validation (bottom) sets forMLR SVM and ANNmodels

log(1119879) more than 20 it was 111 (plusmn79) The highest IPDvalue (1000) in the training set was calculated for GR(log(1119879) = 100) and in the test set it was calculated forPGR (IPD = 654 and log(1119879) = 160) In other wordsthe developed model was not able to predict the bitterness ofless bitter peptides and the selected descriptors are specificfor the evaluation of the bitter peptidesrsquo interactions with thereceptor

LOO cross-validation was done using the PLS toolbox ofMATLAB software The 1199022 value and RMSE were 076 and044 respectivelyThe 1199022 values for LMO cross-validation arereported in Table 3 The ranges of 1199022 values were 061ndash088which was in an acceptable range compared with 1198772 values

Table 5 Leave-many-out cross-validation results for MLR model

Subset 1198772

1198772

adj 1199022

1 080 079 0882 080 079 0873 080 079 0864 080 080 0825 080 080 0826 081 080 0757 081 080 0788 082 081 0769 082 081 08410 082 082 061

Table 6 Chance correlation results

Shuffled 119884 1198772 Shuffled 119884 119877

2

1198841 001 1198846 0021198842 007 1198847 minus0021198843 001 1198848 0031198844 minus002 1198849 0011198845 001 11988410 000

The ranges of corresponding 1198772 and adjusted 1198772 for thedesired subsets were 080ndash082 and 079ndash082 respectively(see Table 5) Chance correlation (119884 randomization) analysiswas done using 10 times shuffled bitter activity and the results(1198772 = 001ndash010) rejected the possibility of fortune correlation(see Table 6)

36 Model Building Using Nonlinear Models (ANN and SVM)and Comparison with the Linear Model (MLR) The sixselected descriptors were introduced to ANN as input valuesand the bitter activity as output data and the networks weredeveloped using the Levenberg-Marquardt algorithm [38]The number of hidden layers was three In addition SVMmodels were developed using STATISTICA 7 software Usingthe training set three parameters of SVM were optimized by10-fold cross-validation The optimized values of 119862 120576 and 120574were 91 007 and 006 respectively The MPD values of theproposed models for training test and validation sets were138 168 and 169 for MLR 130 160 and 150 for SVM and131 161 and 148 for ANN all of which are in acceptableranges and there is no significant difference between thesesubsets The IPD frequencies (Figure 3) revealed that bothSVM and ANN methods produced more accurate resultsespecially for the test sets The plots of experimental versuspredicted values (Figure 4) confirmed the discussed results

37 External Validation of the Proposed Models Statisticalcriteria of external test sets are depicted in Table 7 Theresults show that the proposed models using linear andnonlinear models passed the proposed statistical criteria in

BioMed Research International 11

MLR

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

SVM

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

ANN

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

y = 0785x + 0625

R = 0892y = 0784x + 0596

R = 0899

y = 0800x + 0624

R = 0907

Figure 4 Experimental versus predicted plots for linear and nonlinear models (119877 is calculated using (A7) of the appendix)

Table 7 Statistical parameters for test sets of MLR SVM and ANNmodels

Statistical criteria MLR SVM ANN1198772gt 06 0723 0739 0767(1198772minus 1198772

0) 1198772lt 01 0001 0002 0004

085 le 119870 le 115 0999 1010 098910038161003816100381610038161198772

0minus 11987710158402

0

1003816100381610038161003816lt 03 0109 0113 0106

1198772

119898gt 05 0704 0712 0722

the literature and they are robust and valid for externalprediction

38 Comparison with Previous Model The only similarstudy was published by Kim and Li-Chan [3] They usedthe amino acid three 119911-scores along with three parameters(total hydrophobicity residue number and logmass value) todevelop a PLS model A comparison of the newly developedmodel with their model is summarized in Table 8 It shouldbe noted that the aim of this paper was to develop a generalmodel rather than different models for different subsets andreported correlation coefficients belong to the general modelwhich was computed for subsets The comparison of thecorresponding 119877 values showed that the developed modelcould represent the bitter activity variance of three and morepeptides better than the PLS model while for dipeptides

the PLS model produced more accurate results DevelopedSVM and ANN models resulted in more accurate results forthe total data set compared with both the PLS and MLRmodels (Tables 1 and 5 Figure 4)

4 Conclusion

General MLR SVM and ANN models were developed topredict the bitterness of 229 peptides and amino acidsThe capability of the MLR model to reveal the impact ofeach descriptor on bitter activity was its main advantagewhere more accurate predictions by SVM and ANN madethem suitable models for precise predictions Obviouslyindividual models (ie models developed for each peptidesubset) produced less prediction errors but consideringthe convenience of application of the general models suchmodels are preferred during the primary stages of peptideproduction and evaluation The developed models can beused in nutraceutical and pharmaceutical industries

Appendix

RMSE = radicsum(119884pred minus 119884exp(mean))

2

119873

(A1)

where 119873 denotes the number of data points and 119884pred and119884exp are the predicted and observed log 1119879 (119879 is the bitter

12 BioMed Research International

Table 8 Developed MLR model statistics for subsets of peptides compared with a previously developed model

Data set Developed model Previous modelND 119877

21198772

adj 119877 RMSE 119877 (PLS) 1198772 (PLS)

Dipeptidesa 76 051 047 072 041 063 040Dipeptidesb 45 062 056 079 044 091 083Dipeptidesc 47 059 053 077 042 085 072Three peptides 51 065 060 081 038 071 050Tetrapeptides 23 072 062 085 048 090 081Pentapeptides 12 089 080 094 037 088 077Hexapeptides 20 065 048 080 047 075 056Heptapeptides 16 079 068 089 038 095 090Octa-tetradecapeptides 24 057 042 076 044 mdash mdashWhole data setd 227 080 079 089 046 081 066Test and validation sets 46 076 073 087 056 mdash mdashTraining set 181 081 081 090 043 mdash mdashaAverage of experimental values was used when there were different values in different referencesbExperimental data were taken from different referencescExperimental data were taken from [3]dThe 119877 values for whole data set using SVM and ANNmethods are 090 and 091 respectively

threshold concentration (119872)) values

MSSReg =sum SSReg

df (A2)

MSSRes =sum SSRes

df (A3)

where SS is sum of squares MSS is mean square Reg isregression Res is residual and cv is cross validated

MSSTotal = MSSReg +MSSRes (A4)

119865 =

MSSRegMSSRes

(A5)

1198772= 1 minus

MSSResMSSTotal

(A6)

119877 = (1 minus

MSSResMSSTotal

)

05

(A7)

1199022= 1198772

cv = 1 minusMSSRescvMSSTotal

(A8)

Conflict of Interests

The authors certify that there is no financial relation betweenthem and the mentioned commercial identities in the paper

Acknowledgments

Theauthors thank the Biotechnology Research Center TabrizUniversity of Medical Sciences for a partial financial supportunder Grant no 585765

References

[1] A H Pripp and Y Ardo ldquoModelling relationship betweenangiotensin-(I)-converting enzyme inhibition and the bittertaste of peptidesrdquo Food Chemistry vol 102 no 3 pp 880ndash8882007

[2] A H Pripp T Isaksson L Stepaniak T Soslashrhaug and Y ArdoldquoQuantitative structure activity relationship modelling of pep-tides and proteins as a tool in food sciencerdquo Trends in FoodScience and Technology vol 16 no 11 pp 484ndash494 2005

[3] H O Kim and E C Y Li-Chan ldquoQuantitative structure-activityrelationship study of bitter peptidesrdquo Journal of Agricultural andFood Chemistry vol 54 no 26 pp 10102ndash10111 2006

[4] K Maehashi and L Huang ldquoBitter peptides and bitter tastereceptorsrdquo Cellular and Molecular Life Sciences vol 66 no 10pp 1661ndash1671 2009

[5] E G Walsh B E Adamczyk K B Chalasani et al ldquoOral deliv-ery of macromolecules rationale underpinning gastrointestinalpermeation enhancement technology (GIPET)rdquo TherapeuticDelivery vol 2 no 12 pp 1595ndash1610 2011

[6] R C Pinto C Silva N Martinho et al ldquoDrug carriers for oraldelivery of peptides and proteins accomplishments and futureperspectivesrdquoTherapeutic Delivery vol 4 pp 251ndash265 2013

[7] J Renukuntla A D Vadlapudi A Patel et al ldquoApproachesfor enhancing oral bioavailability of peptides and proteinsrdquoInternational Journal of Pharmaceutics vol 447 pp 75ndash93 2013

[8] S Swain ldquoStabilization and delivery approaches for protein andpeptide pharmaceuticals an extensive review of patentsrdquoRecentPatents on Biotechnology In press

[9] K Maehashi M Matano H Wang L A Vo Y Yamamoto andL Huang ldquoBitter peptides activate hTAS2Rs the human bitterreceptorsrdquo Biochemical and Biophysical Research Communica-tions vol 365 no 4 pp 851ndash855 2008

[10] K H Ney ldquoPrediction of bitterness of peptides from theiramino acid compositionrdquo Zeitschrift fur Lebensmittel-Untersu-chung und -Forschung vol 147 no 2 pp 64ndash68 1971

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003

10 BioMed Research International

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

trai

ning

set

(a)

010203040506070

MLR SVM ANN

IPD

freq

uenc

y of

test

set

(b)

0

10

20

30

40

50

60

70

MLR SVM ANN

IPD

freq

uenc

y of

val

idat

ion

set

lt15

15ndash30gt30

(c)

Figure 3 IPD frequencies (IPD lt 15 IPD = 15ndash30 and IPD gt 30)of the training (top) test (middle) and validation (bottom) sets forMLR SVM and ANNmodels

log(1119879) more than 20 it was 111 (plusmn79) The highest IPDvalue (1000) in the training set was calculated for GR(log(1119879) = 100) and in the test set it was calculated forPGR (IPD = 654 and log(1119879) = 160) In other wordsthe developed model was not able to predict the bitterness ofless bitter peptides and the selected descriptors are specificfor the evaluation of the bitter peptidesrsquo interactions with thereceptor

LOO cross-validation was done using the PLS toolbox ofMATLAB software The 1199022 value and RMSE were 076 and044 respectivelyThe 1199022 values for LMO cross-validation arereported in Table 3 The ranges of 1199022 values were 061ndash088which was in an acceptable range compared with 1198772 values

Table 5 Leave-many-out cross-validation results for MLR model

Subset 1198772

1198772

adj 1199022

1 080 079 0882 080 079 0873 080 079 0864 080 080 0825 080 080 0826 081 080 0757 081 080 0788 082 081 0769 082 081 08410 082 082 061

Table 6 Chance correlation results

Shuffled 119884 1198772 Shuffled 119884 119877

2

1198841 001 1198846 0021198842 007 1198847 minus0021198843 001 1198848 0031198844 minus002 1198849 0011198845 001 11988410 000

The ranges of corresponding 1198772 and adjusted 1198772 for thedesired subsets were 080ndash082 and 079ndash082 respectively(see Table 5) Chance correlation (119884 randomization) analysiswas done using 10 times shuffled bitter activity and the results(1198772 = 001ndash010) rejected the possibility of fortune correlation(see Table 6)

36 Model Building Using Nonlinear Models (ANN and SVM)and Comparison with the Linear Model (MLR) The sixselected descriptors were introduced to ANN as input valuesand the bitter activity as output data and the networks weredeveloped using the Levenberg-Marquardt algorithm [38]The number of hidden layers was three In addition SVMmodels were developed using STATISTICA 7 software Usingthe training set three parameters of SVM were optimized by10-fold cross-validation The optimized values of 119862 120576 and 120574were 91 007 and 006 respectively The MPD values of theproposed models for training test and validation sets were138 168 and 169 for MLR 130 160 and 150 for SVM and131 161 and 148 for ANN all of which are in acceptableranges and there is no significant difference between thesesubsets The IPD frequencies (Figure 3) revealed that bothSVM and ANN methods produced more accurate resultsespecially for the test sets The plots of experimental versuspredicted values (Figure 4) confirmed the discussed results

37 External Validation of the Proposed Models Statisticalcriteria of external test sets are depicted in Table 7 Theresults show that the proposed models using linear andnonlinear models passed the proposed statistical criteria in

BioMed Research International 11

MLR

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

SVM

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

ANN

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

y = 0785x + 0625

R = 0892y = 0784x + 0596

R = 0899

y = 0800x + 0624

R = 0907

Figure 4 Experimental versus predicted plots for linear and nonlinear models (119877 is calculated using (A7) of the appendix)

Table 7 Statistical parameters for test sets of MLR SVM and ANNmodels

Statistical criteria MLR SVM ANN1198772gt 06 0723 0739 0767(1198772minus 1198772

0) 1198772lt 01 0001 0002 0004

085 le 119870 le 115 0999 1010 098910038161003816100381610038161198772

0minus 11987710158402

0

1003816100381610038161003816lt 03 0109 0113 0106

1198772

119898gt 05 0704 0712 0722

the literature and they are robust and valid for externalprediction

38 Comparison with Previous Model The only similarstudy was published by Kim and Li-Chan [3] They usedthe amino acid three 119911-scores along with three parameters(total hydrophobicity residue number and logmass value) todevelop a PLS model A comparison of the newly developedmodel with their model is summarized in Table 8 It shouldbe noted that the aim of this paper was to develop a generalmodel rather than different models for different subsets andreported correlation coefficients belong to the general modelwhich was computed for subsets The comparison of thecorresponding 119877 values showed that the developed modelcould represent the bitter activity variance of three and morepeptides better than the PLS model while for dipeptides

the PLS model produced more accurate results DevelopedSVM and ANN models resulted in more accurate results forthe total data set compared with both the PLS and MLRmodels (Tables 1 and 5 Figure 4)

4 Conclusion

General MLR SVM and ANN models were developed topredict the bitterness of 229 peptides and amino acidsThe capability of the MLR model to reveal the impact ofeach descriptor on bitter activity was its main advantagewhere more accurate predictions by SVM and ANN madethem suitable models for precise predictions Obviouslyindividual models (ie models developed for each peptidesubset) produced less prediction errors but consideringthe convenience of application of the general models suchmodels are preferred during the primary stages of peptideproduction and evaluation The developed models can beused in nutraceutical and pharmaceutical industries

Appendix

RMSE = radicsum(119884pred minus 119884exp(mean))

2

119873

(A1)

where 119873 denotes the number of data points and 119884pred and119884exp are the predicted and observed log 1119879 (119879 is the bitter

12 BioMed Research International

Table 8 Developed MLR model statistics for subsets of peptides compared with a previously developed model

Data set Developed model Previous modelND 119877

21198772

adj 119877 RMSE 119877 (PLS) 1198772 (PLS)

Dipeptidesa 76 051 047 072 041 063 040Dipeptidesb 45 062 056 079 044 091 083Dipeptidesc 47 059 053 077 042 085 072Three peptides 51 065 060 081 038 071 050Tetrapeptides 23 072 062 085 048 090 081Pentapeptides 12 089 080 094 037 088 077Hexapeptides 20 065 048 080 047 075 056Heptapeptides 16 079 068 089 038 095 090Octa-tetradecapeptides 24 057 042 076 044 mdash mdashWhole data setd 227 080 079 089 046 081 066Test and validation sets 46 076 073 087 056 mdash mdashTraining set 181 081 081 090 043 mdash mdashaAverage of experimental values was used when there were different values in different referencesbExperimental data were taken from different referencescExperimental data were taken from [3]dThe 119877 values for whole data set using SVM and ANNmethods are 090 and 091 respectively

threshold concentration (119872)) values

MSSReg =sum SSReg

df (A2)

MSSRes =sum SSRes

df (A3)

where SS is sum of squares MSS is mean square Reg isregression Res is residual and cv is cross validated

MSSTotal = MSSReg +MSSRes (A4)

119865 =

MSSRegMSSRes

(A5)

1198772= 1 minus

MSSResMSSTotal

(A6)

119877 = (1 minus

MSSResMSSTotal

)

05

(A7)

1199022= 1198772

cv = 1 minusMSSRescvMSSTotal

(A8)

Conflict of Interests

The authors certify that there is no financial relation betweenthem and the mentioned commercial identities in the paper

Acknowledgments

Theauthors thank the Biotechnology Research Center TabrizUniversity of Medical Sciences for a partial financial supportunder Grant no 585765

References

[1] A H Pripp and Y Ardo ldquoModelling relationship betweenangiotensin-(I)-converting enzyme inhibition and the bittertaste of peptidesrdquo Food Chemistry vol 102 no 3 pp 880ndash8882007

[2] A H Pripp T Isaksson L Stepaniak T Soslashrhaug and Y ArdoldquoQuantitative structure activity relationship modelling of pep-tides and proteins as a tool in food sciencerdquo Trends in FoodScience and Technology vol 16 no 11 pp 484ndash494 2005

[3] H O Kim and E C Y Li-Chan ldquoQuantitative structure-activityrelationship study of bitter peptidesrdquo Journal of Agricultural andFood Chemistry vol 54 no 26 pp 10102ndash10111 2006

[4] K Maehashi and L Huang ldquoBitter peptides and bitter tastereceptorsrdquo Cellular and Molecular Life Sciences vol 66 no 10pp 1661ndash1671 2009

[5] E G Walsh B E Adamczyk K B Chalasani et al ldquoOral deliv-ery of macromolecules rationale underpinning gastrointestinalpermeation enhancement technology (GIPET)rdquo TherapeuticDelivery vol 2 no 12 pp 1595ndash1610 2011

[6] R C Pinto C Silva N Martinho et al ldquoDrug carriers for oraldelivery of peptides and proteins accomplishments and futureperspectivesrdquoTherapeutic Delivery vol 4 pp 251ndash265 2013

[7] J Renukuntla A D Vadlapudi A Patel et al ldquoApproachesfor enhancing oral bioavailability of peptides and proteinsrdquoInternational Journal of Pharmaceutics vol 447 pp 75ndash93 2013

[8] S Swain ldquoStabilization and delivery approaches for protein andpeptide pharmaceuticals an extensive review of patentsrdquoRecentPatents on Biotechnology In press

[9] K Maehashi M Matano H Wang L A Vo Y Yamamoto andL Huang ldquoBitter peptides activate hTAS2Rs the human bitterreceptorsrdquo Biochemical and Biophysical Research Communica-tions vol 365 no 4 pp 851ndash855 2008

[10] K H Ney ldquoPrediction of bitterness of peptides from theiramino acid compositionrdquo Zeitschrift fur Lebensmittel-Untersu-chung und -Forschung vol 147 no 2 pp 64ndash68 1971

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003

BioMed Research International 11

MLR

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

SVM

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

ANN

0

1

2

3

4

5

6

0 2 4 6

Pred

icte

d

Experimental

y = 0785x + 0625

R = 0892y = 0784x + 0596

R = 0899

y = 0800x + 0624

R = 0907

Figure 4 Experimental versus predicted plots for linear and nonlinear models (119877 is calculated using (A7) of the appendix)

Table 7 Statistical parameters for test sets of MLR SVM and ANNmodels

Statistical criteria MLR SVM ANN1198772gt 06 0723 0739 0767(1198772minus 1198772

0) 1198772lt 01 0001 0002 0004

085 le 119870 le 115 0999 1010 098910038161003816100381610038161198772

0minus 11987710158402

0

1003816100381610038161003816lt 03 0109 0113 0106

1198772

119898gt 05 0704 0712 0722

the literature and they are robust and valid for externalprediction

38 Comparison with Previous Model The only similarstudy was published by Kim and Li-Chan [3] They usedthe amino acid three 119911-scores along with three parameters(total hydrophobicity residue number and logmass value) todevelop a PLS model A comparison of the newly developedmodel with their model is summarized in Table 8 It shouldbe noted that the aim of this paper was to develop a generalmodel rather than different models for different subsets andreported correlation coefficients belong to the general modelwhich was computed for subsets The comparison of thecorresponding 119877 values showed that the developed modelcould represent the bitter activity variance of three and morepeptides better than the PLS model while for dipeptides

the PLS model produced more accurate results DevelopedSVM and ANN models resulted in more accurate results forthe total data set compared with both the PLS and MLRmodels (Tables 1 and 5 Figure 4)

4 Conclusion

General MLR SVM and ANN models were developed topredict the bitterness of 229 peptides and amino acidsThe capability of the MLR model to reveal the impact ofeach descriptor on bitter activity was its main advantagewhere more accurate predictions by SVM and ANN madethem suitable models for precise predictions Obviouslyindividual models (ie models developed for each peptidesubset) produced less prediction errors but consideringthe convenience of application of the general models suchmodels are preferred during the primary stages of peptideproduction and evaluation The developed models can beused in nutraceutical and pharmaceutical industries

Appendix

RMSE = radicsum(119884pred minus 119884exp(mean))

2

119873

(A1)

where 119873 denotes the number of data points and 119884pred and119884exp are the predicted and observed log 1119879 (119879 is the bitter

12 BioMed Research International

Table 8 Developed MLR model statistics for subsets of peptides compared with a previously developed model

Data set Developed model Previous modelND 119877

21198772

adj 119877 RMSE 119877 (PLS) 1198772 (PLS)

Dipeptidesa 76 051 047 072 041 063 040Dipeptidesb 45 062 056 079 044 091 083Dipeptidesc 47 059 053 077 042 085 072Three peptides 51 065 060 081 038 071 050Tetrapeptides 23 072 062 085 048 090 081Pentapeptides 12 089 080 094 037 088 077Hexapeptides 20 065 048 080 047 075 056Heptapeptides 16 079 068 089 038 095 090Octa-tetradecapeptides 24 057 042 076 044 mdash mdashWhole data setd 227 080 079 089 046 081 066Test and validation sets 46 076 073 087 056 mdash mdashTraining set 181 081 081 090 043 mdash mdashaAverage of experimental values was used when there were different values in different referencesbExperimental data were taken from different referencescExperimental data were taken from [3]dThe 119877 values for whole data set using SVM and ANNmethods are 090 and 091 respectively

threshold concentration (119872)) values

MSSReg =sum SSReg

df (A2)

MSSRes =sum SSRes

df (A3)

where SS is sum of squares MSS is mean square Reg isregression Res is residual and cv is cross validated

MSSTotal = MSSReg +MSSRes (A4)

119865 =

MSSRegMSSRes

(A5)

1198772= 1 minus

MSSResMSSTotal

(A6)

119877 = (1 minus

MSSResMSSTotal

)

05

(A7)

1199022= 1198772

cv = 1 minusMSSRescvMSSTotal

(A8)

Conflict of Interests

The authors certify that there is no financial relation betweenthem and the mentioned commercial identities in the paper

Acknowledgments

Theauthors thank the Biotechnology Research Center TabrizUniversity of Medical Sciences for a partial financial supportunder Grant no 585765

References

[1] A H Pripp and Y Ardo ldquoModelling relationship betweenangiotensin-(I)-converting enzyme inhibition and the bittertaste of peptidesrdquo Food Chemistry vol 102 no 3 pp 880ndash8882007

[2] A H Pripp T Isaksson L Stepaniak T Soslashrhaug and Y ArdoldquoQuantitative structure activity relationship modelling of pep-tides and proteins as a tool in food sciencerdquo Trends in FoodScience and Technology vol 16 no 11 pp 484ndash494 2005

[3] H O Kim and E C Y Li-Chan ldquoQuantitative structure-activityrelationship study of bitter peptidesrdquo Journal of Agricultural andFood Chemistry vol 54 no 26 pp 10102ndash10111 2006

[4] K Maehashi and L Huang ldquoBitter peptides and bitter tastereceptorsrdquo Cellular and Molecular Life Sciences vol 66 no 10pp 1661ndash1671 2009

[5] E G Walsh B E Adamczyk K B Chalasani et al ldquoOral deliv-ery of macromolecules rationale underpinning gastrointestinalpermeation enhancement technology (GIPET)rdquo TherapeuticDelivery vol 2 no 12 pp 1595ndash1610 2011

[6] R C Pinto C Silva N Martinho et al ldquoDrug carriers for oraldelivery of peptides and proteins accomplishments and futureperspectivesrdquoTherapeutic Delivery vol 4 pp 251ndash265 2013

[7] J Renukuntla A D Vadlapudi A Patel et al ldquoApproachesfor enhancing oral bioavailability of peptides and proteinsrdquoInternational Journal of Pharmaceutics vol 447 pp 75ndash93 2013

[8] S Swain ldquoStabilization and delivery approaches for protein andpeptide pharmaceuticals an extensive review of patentsrdquoRecentPatents on Biotechnology In press

[9] K Maehashi M Matano H Wang L A Vo Y Yamamoto andL Huang ldquoBitter peptides activate hTAS2Rs the human bitterreceptorsrdquo Biochemical and Biophysical Research Communica-tions vol 365 no 4 pp 851ndash855 2008

[10] K H Ney ldquoPrediction of bitterness of peptides from theiramino acid compositionrdquo Zeitschrift fur Lebensmittel-Untersu-chung und -Forschung vol 147 no 2 pp 64ndash68 1971

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003

12 BioMed Research International

Table 8 Developed MLR model statistics for subsets of peptides compared with a previously developed model

Data set Developed model Previous modelND 119877

21198772

adj 119877 RMSE 119877 (PLS) 1198772 (PLS)

Dipeptidesa 76 051 047 072 041 063 040Dipeptidesb 45 062 056 079 044 091 083Dipeptidesc 47 059 053 077 042 085 072Three peptides 51 065 060 081 038 071 050Tetrapeptides 23 072 062 085 048 090 081Pentapeptides 12 089 080 094 037 088 077Hexapeptides 20 065 048 080 047 075 056Heptapeptides 16 079 068 089 038 095 090Octa-tetradecapeptides 24 057 042 076 044 mdash mdashWhole data setd 227 080 079 089 046 081 066Test and validation sets 46 076 073 087 056 mdash mdashTraining set 181 081 081 090 043 mdash mdashaAverage of experimental values was used when there were different values in different referencesbExperimental data were taken from different referencescExperimental data were taken from [3]dThe 119877 values for whole data set using SVM and ANNmethods are 090 and 091 respectively

threshold concentration (119872)) values

MSSReg =sum SSReg

df (A2)

MSSRes =sum SSRes

df (A3)

where SS is sum of squares MSS is mean square Reg isregression Res is residual and cv is cross validated

MSSTotal = MSSReg +MSSRes (A4)

119865 =

MSSRegMSSRes

(A5)

1198772= 1 minus

MSSResMSSTotal

(A6)

119877 = (1 minus

MSSResMSSTotal

)

05

(A7)

1199022= 1198772

cv = 1 minusMSSRescvMSSTotal

(A8)

Conflict of Interests

The authors certify that there is no financial relation betweenthem and the mentioned commercial identities in the paper

Acknowledgments

Theauthors thank the Biotechnology Research Center TabrizUniversity of Medical Sciences for a partial financial supportunder Grant no 585765

References

[1] A H Pripp and Y Ardo ldquoModelling relationship betweenangiotensin-(I)-converting enzyme inhibition and the bittertaste of peptidesrdquo Food Chemistry vol 102 no 3 pp 880ndash8882007

[2] A H Pripp T Isaksson L Stepaniak T Soslashrhaug and Y ArdoldquoQuantitative structure activity relationship modelling of pep-tides and proteins as a tool in food sciencerdquo Trends in FoodScience and Technology vol 16 no 11 pp 484ndash494 2005

[3] H O Kim and E C Y Li-Chan ldquoQuantitative structure-activityrelationship study of bitter peptidesrdquo Journal of Agricultural andFood Chemistry vol 54 no 26 pp 10102ndash10111 2006

[4] K Maehashi and L Huang ldquoBitter peptides and bitter tastereceptorsrdquo Cellular and Molecular Life Sciences vol 66 no 10pp 1661ndash1671 2009

[5] E G Walsh B E Adamczyk K B Chalasani et al ldquoOral deliv-ery of macromolecules rationale underpinning gastrointestinalpermeation enhancement technology (GIPET)rdquo TherapeuticDelivery vol 2 no 12 pp 1595ndash1610 2011

[6] R C Pinto C Silva N Martinho et al ldquoDrug carriers for oraldelivery of peptides and proteins accomplishments and futureperspectivesrdquoTherapeutic Delivery vol 4 pp 251ndash265 2013

[7] J Renukuntla A D Vadlapudi A Patel et al ldquoApproachesfor enhancing oral bioavailability of peptides and proteinsrdquoInternational Journal of Pharmaceutics vol 447 pp 75ndash93 2013

[8] S Swain ldquoStabilization and delivery approaches for protein andpeptide pharmaceuticals an extensive review of patentsrdquoRecentPatents on Biotechnology In press

[9] K Maehashi M Matano H Wang L A Vo Y Yamamoto andL Huang ldquoBitter peptides activate hTAS2Rs the human bitterreceptorsrdquo Biochemical and Biophysical Research Communica-tions vol 365 no 4 pp 851ndash855 2008

[10] K H Ney ldquoPrediction of bitterness of peptides from theiramino acid compositionrdquo Zeitschrift fur Lebensmittel-Untersu-chung und -Forschung vol 147 no 2 pp 64ndash68 1971

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003

BioMed Research International 13

[11] F Roudot-Algaron ldquoThe taste of amino acids peptides andproteins examples of tasty peptides in casein hydrolysatesrdquoLaitvol 76 no 4 pp 313ndash348 1996

[12] TMatoba and T Hata ldquoRelationship between bitterness of pep-tides and their chemical structuresrdquo Agricultural and BiologicalChemistry vol 36 pp 1423ndash1431 1972

[13] J Wu and R E Aluko ldquoQuantitative structure-activity relation-ship study of bitter di- and tri-peptides including relationshipwith angiotensin I-converting enzyme inhibitory activityrdquo Jour-nal of Peptide Science vol 13 no 1 pp 63ndash69 2007

[14] G Liang L Yang L Kang H Mei and Z Li ldquoUsing multidi-mensional patterns of amino acid attributes for QSAR analysisof peptidesrdquo Amino Acids vol 37 no 4 pp 583ndash591 2009

[15] Z H Lin H X Long Z Bo Y Q Wang and Y Z Wu ldquoNewdescriptors of amino acids and their application to peptideQSAR studyrdquo Peptides vol 29 no 10 pp 1798ndash1805 2008

[16] J Tong S Liu P Zhou B Wu and Z Li ldquoA novel descriptorof amino acids and its application in peptide QSARrdquo Journal ofTheoretical Biology vol 253 no 1 pp 90ndash97 2008

[17] J Yin Y Diao Z Wen Z Wang and M Li ldquoStudying peptidesbiological activities based on multidimensional descriptors(E) using support vector regressionrdquo International Journal ofPeptide Research and Therapeutics vol 16 no 2 pp 111ndash1212010

[18] A Zaliani and E Gancia ldquoMS-WHIM scores for amino acidsa new 3D-description for peptide QSAR and QSPR studiesrdquoJournal of Chemical Information and Computer Sciences vol 39no 3 pp 525ndash533 1999

[19] E R Collantes and W J Dunn III ldquoAmino acid side chaindescriptors for quantitative structure-activity relationship stud-ies of peptide analoguesrdquo Journal of Medicinal Chemistry vol38 no 14 pp 2705ndash2713 1995

[20] S Hellberg L Eriksson J Jonsson et al ldquoMinimum analoguepeptide sets (MAPS) for quantitative structure-activity relation-shipsrdquo International Journal of Peptide and Protein Research vol37 no 5 pp 414ndash424 1991

[21] M Cocchi and E Johansson ldquoAmino acids characterization byGRID and multivariate data analysisrdquo Quantitative Structure-Activity Relationships vol 12 no 1 pp 1ndash8 1993

[22] S Liu C Yin S Cai and Z Li ldquoA novel MHDV descriptorfor dipeptide QSAR studiesrdquo Journal of the Chinese ChemicalSociety vol 48 no 2 pp 253ndash260 2001

[23] S Z Li B Fu Y Wang and S Liu ldquoOn structural param-eterization and molecular modeling of peptide analogues bymolecular electronegativity edge vector (VMEE) estimationand prediction for biological activity of dipeptidesrdquo Journal ofthe Chinese Chemical Society vol 48 no 5 pp 937ndash944 2001

[24] P Zhou Y Zhou S Wu B Li F Tian and Z Li ldquoA newdescriptor of amino acids based on the three-dimensionalvector of atomic interaction fieldrdquo Chinese Science Bulletin vol51 no 5 pp 524ndash529 2006

[25] R Ramos de Armas H Gonzalez Dıaz R Molina M PerezGonzalez and E Uriarte ldquoStochastic-based descriptors study-ing peptides biological properties modeling the bitter tastingthreshold of dipeptidesrdquoBioorganic ampMedicinal Chemistry vol12 no 18 pp 4815ndash4822 2004

[26] G Z Liang P Zhou Y Zhou Q X Zhang and Z L LildquoNew descriptors of aminoacids and their applications to pep-tide quantitative structure-activity relationshiprdquo Acta ChimicaSinica vol 64 no 5 pp 393ndash396 2006

[27] J J Yin ldquoStudy of peptides QSAR based on multidimensionalattributes (E) using multiple linear regressionrdquo Advanced Mate-rials Research vol 345 pp 263ndash269 2011

[28] R Todeschini and V Consonni Methods and Principles inMedicinal ChemistryMolecular Descriptors for Chemoinformat-ics Wiley-VCH Weinheim Germany 2009

[29] Organisation for economic co-operation development ldquoGuid-ance document on the validation of (quantitative) structureactivityrdquo ENVJMMONO 2 2007

[30] A Golbraikh and A Tropsha ldquoBeware of q2rdquo Journal ofMolecular Graphics and Modelling vol 20 no 4 pp 269ndash2762002

[31] A Tropsha ldquoBest practices for QSAR model developmentvalidation and exploitationrdquoMolecular Informatics vol 29 no6-7 pp 476ndash488 2010

[32] P P Roy and K Roy ldquoOn some aspects of variable selection forpartial least squares regression modelsrdquoQSAR and Combinato-rial Science vol 27 no 3 pp 302ndash313 2008

[33] T Scior A Bender G Tresadern et al ldquoRecognizing pitfallsin virtual screening a critical reviewrdquo Journal of ChemicalInformation and Modeling vol 52 pp 867ndash881 2012

[34] D S Cao Y Z Liang Q S Xu H D Li and X Chen ldquoAnew strategy of outlier detection for QSARQSPRrdquo Journal ofComputational Chemistry vol 31 no 3 pp 592ndash602 2009

[35] C Nantasenamat C Isarankura-Na-Ayudhya T Naennaand V Prachayasittikul ldquoA practical overview of quantitativestructure-activity relationshiprdquo Experimental and Clinical Sci-ences International Journal vol 8 pp 74ndash88 2009

[36] R Leardi ldquoApplication of genetic algorithm-PLS for featureselection in spectral data setsrdquo Journal of Chemometrics vol 14pp 643ndash655 2000

[37] S Soltani H Abolhasani A Zarghi and A Jouyban ldquoQSARanalysis of diaryl COX-2 inhibitors comparison of featureselection and train-test data selection methodsrdquo EuropeanJournal of Medicinal Chemistry vol 45 no 7 pp 2753ndash27602010

[38] M Shahlaei ldquoDescriptor selection methods in quantitativestructure-activity relationship studies a review studyrdquoChemicalReviews vol 113 no 10 pp 8093ndash8103 2013

[39] M Jalali-Heravi M Asadollahi-Baboli and P ShahbazikhahldquoQSAR study of heparanase inhibitors activity using artificialneural networks and Levenberg-Marquardt algorithmrdquo Euro-pean Journal of Medicinal Chemistry vol 43 no 3 pp 548ndash5562008

[40] J C Dearden M T D Cronin and K L E Kaiser ldquoHow not todevelop a quantitative structure-activity or structure-propertyrelationship (QSARQSPR)rdquo SAR and QSAR in EnvironmentalResearch vol 20 no 3-4 pp 241ndash266 2009

[41] R Todeschini V Consonni R Mannhold et al Methodsand Principles in Medicinal Chemistry Handbook of MolecularDescriptors Wiley-VCH Weinheim Germany 2003