Premium et Atrium sous Unity Pro - Manuel utilisateur - 07/2016
Thesis (Mohsen Yoosefzadeh Najafabadi) - The Atrium
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of Thesis (Mohsen Yoosefzadeh Najafabadi) - The Atrium
Using Advanced Proximal Sensing and Genotyping Tools Combined
with Bigdata Analysis Methods to Improve Soybean Yield
By
Mohsen Yoosefzadeh Najafabadi
A Thesis
Presented to
The University of Guelph
In partial fulfillment of requirements
for the degree of
Doctor of Philosophy
In
Plant Agriculture
Guelph, Ontario, Canada
© Mohsen Yoosefzadeh Najafabadi, July, 2021
ABSTRACT
USING ADVANCED PROXIMAL SENSING AND GENOTYPING TOOLS COMBINED WITH BIGDATA ANALYSIS METHODS TO IMPROVE SOYBEAN YIELD
Mohsen Yoosefzadeh Najafabadi Advisor:
University of Guelph, 2021 Dr. Milad Eskandari
Improving yield potential in major food-grade crops such as soybean (Glycine max L.) is
the most sustainable way to address the growing global food demand and its security
concerns. Selections for high-yielding cultivars have been mainly focused on the yield
performance per se but not necessarily on secondary related-traits associated with yield.
Recent substantial advances in proximal sensing have provided plant breeders with
affordable and efficient tools for evaluating a large number of genotypes for important
agronomic traits, including yield, at early growth stages. Nevertheless, the implementation
of large datasets generated by proximal sensing such as hyperspectral reflectance in
cultivar development programs is still challenging due to the essential need for intensive
knowledge in computational and statistical analyses. Therefore, this thesis was aimed to:
(1) investigate the potential use of soybean hyperspectral reflectance, hyperspectral
reflectance indices (HVI), and yield components such as number of nodes (NP), number
of non-reproductive nodes (NRNP), number of reproductive nodes (RNP), and number of
pods (PP) per plant for predicting the final seed yield using different machine learning
(ML) algorithms, (2) select the top-ranking hyperspectral reflectance and HVI in predicting
soybean yield and fresh biomass (FBIO) using recursive feature elimination (RFE)
strategy, (3) implement genetic optimization algorithm and the improved version of the
strength Pareto evolutionary algorithm 2 (SPEA2) to optimize yield components and HVI
for maximizing soybean seed yield and FBIO, and (4) study the genetics of soybean yield
and its secondary related-traits in order to discover genomic regions underlying the traits
by using genome-wide association study (GWAS). In this study, different ML algorithms
such as ensemble stacking (E-S), ensemble bagging (EB), and deep neural network
(DNN) were tested to evaluate their efficiency in predicting soybean yield and FBIO
production using a panel of 250 genotypes evaluated in four environments. Also, for the
first time, we implemented ML algorithms in GWAS to detect the associated QTL with
soybean yield components. The results of this study may provide a perspective for
geneticists and breeders regarding the use of ML algorithms in phenomics and genomics
that will result in the selection of superior soybean genotypes.
v
ACKNOWLEDGEMENTS
It is a huge genuine pleasure to express my sincere appreciation to my advisor, Dr. Milad
Eskandari, for accepting me as a student to pursue a Ph.D. degree at the University of Guelph. I
appreciated all support and help that I received from Dr. Milad Eskandari during my Ph.D.
Also, I would like to thank my advisory committee: Dr. Istvan Rajcan, Dr. Dan Tulpan, Dr. Hugh
Earl, and Dr. John Sulik, for their contribution to my project development and throughout my
studies. I would like to acknowledge the technical assistance of Dr. Sepideh Torabi, Mr. Bryan
Stirling, Mr. John Kobler, Mr. Robert Brandt, and all the soybean breeding crew at the University
of Guelph, Ridgetown Campus. Thanks to the University of Guelph, Grain Farmers of Ontario,
and SeCan for their financial support to this project.
I would like to thank my examination committee: Dr. Zenglu Li, Dr. Helen Booker, Dr. John Cline,
Dr. Istvan Rajcan, and Dr. Milad Eskandari, for their careful consideration of this thesis and their
inputs on my work
Special thanks to my friends, Dr. Davoud Torkamaneh, Dr. Soren Seifi, and Dr. Mohsen Hesami,
for their support and help during my Ph.D. I cannot imagine myself without their help in all aspects
of my life.
Finally, very special thanks to my family, especially my partner, Maryam Vazin, for giving me all
courage and love to accomplish my Ph.D. journey. My dear family, none of those steps would
have been possible without your help and support. Thank you for your patience and
understanding.
vi
TABLE OF CONTENTS
Abstract ....................................................................................................................... ii
Dedication ...................................................................................................................iv
Acknowledgements .................................................................................................... v
Table of Contents .......................................................................................................vi
List of Tables............................................................................................................... xi
List of Figures .................................................................................................................... xiii
Chapter 1: Introduction and Literature Review .......................................................... 1
1.1 Introduction ................................................................................................ 1
1.2 Literature Review ..................................................................................... 12
1.2.1 Introduction of soybean ........................................................................ 12
1.2.2 Importance of soybean ..................................................................... 12
1.2.3 Yield enhancement from cultivar development ................................. 13
1.2.4 Yield components ............................................................................. 13
1.2.5 Spectral reflectance .......................................................................... 15
1.2.6 Hyperspectral vegetation indices (HVI) ............................................. 16
1.2.7 Application of HVI for predicting yield ............................................... 18
1.2.8 Measuring other traits through spectral reflectance .......................... 19
1.2.9 Genome-wide association studies (GWAS) ...................................... 20
1.2.10 Conventional Statistical methods used in GWAS ............................. 23
1.2.11 Artificial Intelligence Techniques ...................................................... 34
1.2.12 Optimization algorithm ...................................................................... 35
1.2.13 Application of AI-mediated GWAS analysis in plants ........................ 36
1.2.14 Evaluating thresholds in AI-mediated GWAS analysis ...................... 37
vii
1.3 Thesis Hypothesis and Objectives ........................................................... 40
1.3.1 Hypotheses .......................................................................................... 40
1.3.2 Objectives ......................................................................................... 40
Chapter 2: Application of Machine Learning Algorithms in Plant Breeding: Predicting Yield from Hyperspectral Reflectance in Soybean ........................................................... 42
1.4 Abstract .................................................................................................... 43
1.5 Introduction .............................................................................................. 44
1.6 Materials and Methods............................................................................. 49
1.6.1 Experimental locations and plant materials ...................................... 49
1.6.2 Phenotypic evaluations ..................................................................... 50
1.6.3 Data pre-processing and statistical analyses ....................................... 51
1.6.4 Variable selection ................................................................................. 53
1.6.5 Data-driven modeling ........................................................................... 53
1.6.6 Quantification of machine learning performance .................................. 55
1.6.7 Visualizing and analyzing ..................................................................... 56
1.7 Results ..................................................................................................... 56
1.7.1 Yield Statistics and Spectral Profiles .................................................... 56
1.7.2 Variable selection ................................................................................. 57
1.7.3 Growth stage comparison .................................................................... 58
1.7.4 Comparative analysis of the developed models ................................... 58
1.8 Discussion ............................................................................................... 60
1.9 Acknowledgments .................................................................................... 65
Chapter 3: Using Ensemble bagging and Depp Neural Network Algorithms and Evolutionary Optimization Algorithms for Estimating Soybean Yield and Fresh Biomass Using Hyperspectral Vegetation Indices .......................................................................... 88
viii
2.1 Abstract .................................................................................................... 89
2.2 Introduction .............................................................................................. 90
2.3 Materials and Methods............................................................................. 96
2.3.1 Plant Material .................................................................................... 96
2.3.2 Test sites and experimental designs ................................................. 96
2.3.3 Data acquisition .................................................................................... 97
2.3.4 Data pre-processing and statistical analyses ....................................... 98
2.3.5 Hyperspectral vegetation index (HVI) extraction .................................. 99
2.3.6 Variable selection ................................................................................. 99
2.3.7 Yield and FBIO prediction model calibration and validation ............... 100
2.3.8 Optimization process (SPEA2 algorithm) ........................................... 102
2.3.9 Quantification of model performance and error estimations ............... 103
2.4 Results ................................................................................................... 103
2.4.1 Yield, FBIO, and HVI properties ......................................................... 103
2.4.2 Correlation analysis of HVI vs. soybean yield and FBIO .................... 104
2.4.3 Comparative analysis of the EB and DNN algorithms ........................ 105
2.4.4 Variable Selection .............................................................................. 105
2.4.5 The DNN-SPEA2 optimization algorithm ............................................ 106
2.5 Discussion ............................................................................................. 106
Chapter 4: Application of Machine Learning and Genetic Optimization Algorithms for Modeling and Optimizing Soybean Yield Using its Component Traits .......................... 137
3.1 Abstract .................................................................................................. 138
3.2 Introduction ............................................................................................ 139
3.3 Material and methods ............................................................................ 145
ix
3.3.1 Plant material and experimental design .......................................... 145
3.3.2 Phenotypic evaluations ...................................................................... 146
3.3.3 Data pre-processing, correlation coefficient, and statistical analyses 146
3.3.4 Data-driven modeling ......................................................................... 147
3.3.5 Optimization process via GA .............................................................. 148
3.3.6 Quantification of model performance and error estimations ............... 149
3.3.7 Visualizing and statistical analyzing ................................................... 150
3.4 Results ................................................................................................... 151
3.4.1 Pearson correlation analyses and individual ML evaluations ............. 151
3.4.2 Model performance and evaluation .................................................... 152
3.4.3 Optimization of the soybean seed yield using E-B-GA ....................... 153
3.5 Discussion ............................................................................................. 153
3.6 Acknowledgments .................................................................................. 158
Chapter 5: Identification of genomic regions underlying soybean yield and its components using conventional and machine learning mediated Genome-Wide Association Studies ......................................................................................................... 184
4.1 Abstract .................................................................................................. 184
4.2 Introduction ............................................................................................ 185
4.3 Materials and Methods........................................................................... 190
4.3.1 Population and experimental design ............................................... 190
4.3.2 Phenotyping .................................................................................... 190
4.3.3 Genotyping ..................................................................................... 191
4.3.4 Statistical analyses ......................................................................... 192
4.3.5 Analysis of population structure ...................................................... 193
4.3.6 Association studies ......................................................................... 193
x
4.3.7 Variable Importance measurement ................................................. 195
4.3.8 Data-driven model processes ......................................................... 195
4.3.9 Functional annotation of candidate SNPs ....................................... 196
4.3.10 Visualization ................................................................................... 196
4.4 Results ................................................................................................... 197
4.4.1 Phenotyping evaluations ................................................................. 197
4.4.2 Genotyping evaluations .................................................................. 197
4.4.3 Population structure and kinship ..................................................... 198
4.4.4 GWAS analysis ............................................................................... 198
4.4.5 Identification of candidate genes within QTL .................................. 201
4.5 Discussion ............................................................................................. 203
Chapter 6: General Discussion and Future Directions .......................................... 237
4.6 General Discussion ................................................................................ 237
4.7 Future Directions ................................................................................... 241
References .............................................................................................................. 244
xi
LIST OF TABLES
Table 1.1 The summary of the most recently published papers on the application of GWAS on different plant species. ............................................................................ 26
Table 1.2 The summary of all published papers on using Artificial Intelligence-mediated Genome-Wide Association Studies (GWAS) analysis in soybean. .......................... 37
Table 2.1. The ID, name, and pedigree of the tested genotypes. ................................. 66
Table 2.2 Combined analyses of variances for yield in the tested population across four environments. ......................................................................................................... .72
Table 2.3 Reflectance band ranking using the Recursive Feature Elimination (RFE) strategy at R4 soybean growth stage. ..................................................................... 74
Table 2.4 Reflectance band ranking using the Recursive Feature Elimination (RFE) strategy at R5 soybean growth stage. ..................................................................... 75
Table 2.5 Confusion matrix based on the performance of RF, MLP, SVM, and Ensemble-Stacking model in predicting the soybean yield using full and selected variables in R5 soybean growth stage.............................................................................................. 76
Table 3.1 Summary of the selected hyperspectral vegetation indices (HVI) used in this study. ..................................................................................................................... 113
Table 3.2 Analysis performance of random forest (RF), radial basis function (RBF), and support vector regression (SVR) algorithms for soybean yield prediction using yield component traits. ................................................................................................... 115
Table 3.3 Analysis performance of random forest (RF), radial basis function (RBF), and support vector regression (SVR) algorithms for soybean fresh biomass (FBIO) prediction using yield component traits. ................................................................. 122
Table 4.1 Analysis performance of Random Forest (RF), Multilayer Perceptron (MLP), and Radial Basis Function (RBF) algorithms, and the Ensemble-Bagging (E-B) strategy for soybean yield prediction using yield component traits such as The number of nodes per plant (NP), number of non-Reproductive nodes per plant (NRNP), number of reproductive nodes per plant (RNP), number of pods per Plant (PP), and ratio of number of pods to number of nodes per plant (P/N). ................................. 159
Table 4.2 Optimizing the number of nodes per plant (NP), number of non-Reproductive nodes per plant (NRNP), number of reproductive nodes per plant (RNP), number of pods per Plant (PP), and ratio of number of pods to number of nodes per plant (P/N) according to the E-B-GA for maximizing soybean seed yield. ............................... 177
xii
Table 5.1 The list of detected QTL for soybean maturity using different GWAS methods in the tested soybean population. .......................................................................... 211
Table 5.2 The list of detected QTL for soybean yield using different GWAS methods in the tested soybean population. .............................................................................. 214
Table 5.3 The list of detected QTL for soybean total number of nodes per plant (NP) using different GWAS methods in the tested soybean population. .................................. 217
Table 5.4 The list of detected QTL for soybean total number of non-reproductive nodes per plant (NRNP) using different GWAS methods in the tested soybean population. .............................................................................................................................. 221
Table 5.5 The list of detected QTL for soybean total number of reproductive nodes per plant (RNP) using different GWAS methods in the tested soybean population. .... 223
Table 5.6 The list of detected QTL for soybean total number of pods per plant (PP) using different GWAS methods in the tested soybean population. .................................. 224
xiii
LIST OF FIGURES
Figure 1.1 The genetic and environmental interaction of secondary traits that are related to the soybean yield. ................................................................................................. 4
Figure 1.2 The scheme of the relationship of yield components and their determining factors. .................................................................................................................... 15
Figure 1.3 The region of each reflectance bands used for constructing HVI. SRa1: Simple ratio (845 nm), SRa2: Simple ratio (905 nm), RARSa: Ratio analysis of reflectance spectra chlorophyll a, NDVI: Normalized difference vegetation index, GNDVI: Green normalized difference vegetation index, RARSc: Ratio analysis of reflectance spectra chlorophyll c, RARSb: Ratio analysis of reflectance spectra chlorophyll b, BNDVI: Blue normalized difference vegetation index, and NPQI: Normalized pheophytinization index ........................................................................................... 17
Figure 1.4 Schematic flowchart of implementing variable importance in AI-mediated GWAS analysis. ...................................................................................................... 39
Figure 2.1 The distribution of soybean genotypes in each yield class. ......................... 78
Figure 2.2 A schematic representation of the machine learning algorithms used in this study to classify the soybean yield using reflectance bands: A) Multilayer Perceptron, B) Support Vector Machine, and D) Random Forest. .............................................. 79
Figure 2.3 The scheme of data collection and machine learning algorithm development and validation. OP: Optimizing parameters; MLP: Multilayer Perceptron; SVM: Support Vector Machine; RF: Random Forest; E-S: Ensemble- stacking strategy. . 80
Figure 2.4 The variation of yield across four environments. ......................................... 81
Figure 2.5 The minimum, mean, and maximum values of each reflectance band were measured for 245 soybean genotypes evaluated at (A) R4 and (B) R5 growth stages at four different field environments. ......................................................................... 82
Figure 2.6 The importance value of selected variables based on the Recursive Feature Elimination (RFE) strategy for soybean reflectance bands measured at R4 (A) and R5 (B) soybean growth stages. ..................................................................................... 83
Figure 2.7 The soybean yield classes vs. the 395 nm reflectance band at R5 growth stages. ..................................................................................................................... 84
Figure 2.8 The accuracy of RF, MLP, SVM, and E-S algorithms for predicting soybean yield using full and RFE selected variables (-VS) measured at R4 (A) and R5 (B) soybean growth stages in four environments. The mean performance was shown as × in each figure. MLP: Multilayer Perceptron; SVM: Support Vector Machine; RF:
xiv
Random Forest; E-S: Ensemble- stacking strategy; RFE: Recursive Feature Elimination. .............................................................................................................. 85
Figure 2.9 Performance and error evaluation of RF, MLP, SVM and E-S model in soybean yield prediction using full and selected reflectance bands from R4 growth stage (VS: variable selection). .................................................................................................. 86
Figure 2.10 Performance of RF, MLP, SVM, and the E-S algorithms for soybean yield prediction using all and selected variables (-VS) from the R5 growth stage. The mean performance is indicated with as × in each figure. MLP: Multilayer Perceptron; SVM: Support Vector Machine; RF: Random Forest; E-S: Ensemble- stacking strategy; RFE: Recursive Feature Elimination. ...................................................................... 87
Figure 3.1 The schematic view of the Deep Neural Network (DNN) algorithm. .......... 129
Figure 3.2 The schematic diagram of the Improved version of the strength Pareto evolutionary algorithms-2 (SPEA2) as multi-objective optimization algorithm. ...... 130
Figure 3.3 The experimental workflow of algorithm selection and validation to predict the soybean fresh biomass (FBIO) and seed yield using hyperspectral vegetation indices (HVI). DNN; deep neural network, EB; ensemble bagging. ................................... 131
Figure 3.4 The variation of A) yield and B) fresh biomass (FBIO) across four environments. ........................................................................................................ 132
Figure 3.5 Pearson correlation analysis of hyperspectral vegetation indices (HVI), soybean seed yield, and fresh biomass (FBIO). The colour coded heatmap scale is provided................................................................................................................. 133
Figure 3.6 Violin plots representing the coefficient of determination (R2), mean absolute errors (MAE), and root mean square error (RMSE) of yield (A, B, and C) and fresh biomass (D, E, and F) performance of the deep neural network (DNN) algorithm and ensemble-bagging (EB) strategy for soybean yield and fresh biomass prediction using hyperspectral vegetation indices (HVI). ................................................................. 134
Figure 3.7 The importance of the tested hyperspectral vegetation indices (HVI) based on the recursive feature elimination (RFE) strategy for predicting soybean seed yield and fresh biomass (FBIO). ........................................................................................... 135
Figure 3.8 Optimized hyperspectral vegetation indices (HVI) values in soybeans with maximized seed yield and fresh biomass (FBIO). ................................................. 136
Figure 4.1 The schematic diagram of the genetic algorithm as the single objective evolutionary optimization algorithm. ...................................................................... 178
xv
Figure 4.2 The experimental workflow of algorithm selection and validation for predicting the soybean seed yield. The number of nodes per plant (NP), number of non-reproductive nodes per plant (NRNP), number of reproductive nodes per plant (RNP), number of pods per plant (PP), ratio of number of Pods to number of Nodes per plant (P/N), and genetic algorithm (GA). ........................................................................ 179
Figure 4.3 Pearson correlation analysis of soybean yield component traits at α=0.05. The number of nodes per plant (NP), number of non-reproductive nodes per plant (NRNP), number of reproductive nodes per plant (RNP), number of pods per plant (PP), and ratio of number of Pods to number of Nodes per plant (P.N). The colour coded heatmap scale is provided. .................................................................................... 180
Figure 4.4 Model training accuracy for Random Forest (RF), Multilayer Perceptron (MLP), and Radial Basis Function (RBF) algorithms by adding variables based on the correlation results. The number of nodes per plant (NP), number of non-reproductive nodes per plant (NRNP), number of reproductive nodes per plant (RNP), number of pods per plant (PP), and ratio of number of Pods to number of Nodes per plant (P/N). .............................................................................................................................. 181
Figure 4.5 (A) coefficient of determination (R2), (B) the Root Mean Square Error (RMSE) and (C) the Mean Absolute Errors (MAE) performance of Random Forest (RF), Multilayer Perceptron (MLP), and Radial Basis Function (RBF) algorithms, and the Ensemble-Bagging (E-B) strategy for soybean yield prediction using yield component traits. The mean performance is indicated with an + sign in each Figure. ............. 182
Figure 4.6 The schematic view of the Radial basis function (RBF) algorithm. ............ 183
Figure 5.1 LD decay distance in the tested 227 soybean genotypes. ........................ 228
Figure 5.2 The distribution of seed yield (A), maturity (B), NP (C), NRNP (D), RNP (E), and PP (F) in 227 soybean genotypes across four environments. The estimated heritability is provided for each of the six traits. RNP: Total number of reproductive nodes per plant, NRNP: The total number of non-reproductive nodes per plant, NP: The total nodes per plant, PP: The total number of pods per plant. ...................... 229
Figure 5.3 The distributions and Pearson correlations among the soybean seed yield, maturity, and yield component traits. RNP: Total number of reproductive nodes per plant, NRNP: The total number of non-reproductive nodes per plant, NP: The total nodes per plant, PP: The total number of pods per plant. The heat map scale for values is provided by colour for the panel. ............................................................ 230
Figure 5.4 Structure and kinship plots for the 227 soybean genotypes. The x-axis is the number of genotypes used in this GWAS panel, and the y axis is the membership of each subgroup. G1-G7 stands for the subpopulation. ........................................... 231
xvi
Figure 5.5 Genome-wide Manhattan plots for GWAS studies of A) maturity and B) seed yield in soybean using MLM, FarmCPU, RF, and SVR methods, from top to bottom, respectively. .......................................................................................................... 232
Figure 5.6 Genome-wide Manhattan plots for GWAS studies of A) the total number of nodes (NP) and B) the total number of non-reproductive nodes (NRNP) in soybean using MLM, FarmCPU, RF, and SVR methods, from top to bottom, respectively. 233
Figure 5.7 Genome-wide Manhattan plots for GWAS studies of A) The total number of reproductive nodes (RNP) and B) the total number of pods (PP) in soybean using MLM, FarmCPU, RF, and SVR methods, from top to bottom, respectively. .......... 234
Figure 5.8 Genome-wide Manhattan plots for GWAS studies of plant height in soybean using MLM, FarmCPU, RF, and SVR methods, from top to bottom, respectively. 235
Figure 5.9 The average effects of reference allele and alternative allele from the detected SNP’s peak for seed yield (A), maturity (B), NP (C), NRNP (D), RNP (E), and PP (F) in 227 soybean genotypes across four environments. RNP: Total number of reproductive nodes per plant, NRNP: The total number of non-reproductive nodes per plant, NP: The total nodes per plant, PP: The total number of pods per plant ....... 236
1
Chapter 1: Introduction and Literature Review
1.1 Introduction
The human population is growing roughly by 1.2% per year, creating a dire need
for increasing the yield production of major crops by more than 50% by 2050 to address
global demand for food security and supplying the needs of over 9 billion people
(Alexandratos and Bruinsma, 2012). Some of the suggested ways for increasing total
yield in major crops, such as soybean (Glycine max L.), are cultivating more lands, better
use of chemical fertilizers and pesticides, and increasing yield potential (Godfray et al.,
2010). Because of the limitation in the availability of farmlands and environmental
concerns associated with using chemical fertilizers and pest and disease controllers,
increasing yield potential is considered as the most reliable and sustainable way of
producing enough food for the fast-growing world’s population (Blaikie and Brookfield,
2015).
Soybean is the world’s most widely grown leguminous crop and an important source
of protein and oil for food and feed (Liu et al., 2008). Soybean is also an important source
of healthy plant-based food products in the human diet with significant health benefits due
mainly to its nutritional and pharmaceutical properties. As a result, global demand for
soybean is increasing significantly. However, the current annual genetic gains in yield
potential are not able to cope with the pace of increasing food demand. One of the main
2
reasons for this steady genetic gain can be inefficient selection criteria used in breeding
programs to select desirable genotypes with increased yield potential.
In general, soybean breeders have used classical phenotypic selection approaches
to evaluate and exploit total seed yield as the main selection criterion for identifying
higher-yielding genotypes. Therefore, over the past several decades, soybean yield has
increased by 21-32 kg per hectare per year (Wilcox 2001). However, genetic gains for
yield were reported to be low, in general, and inconsistent across different environments
(Aparicio et al., 2002). The final seed yield in soybean is the ultimate outcome of the
whole crop growth and development process, in which several morphological and
physiological traits and, therefore, many genes with major and minor effects play
important roles, directly or indirectly (Slafer, 2003). Therefore, one potential way to
increase the genetic gain for yield and facilitate the development of new cultivars with
higher yield potential is to use more efficient selection criteria through analytical breeding
strategies.
An analytical breeding strategy is an alternate breeding approach that focuses on
the improvement of components of complex traits such as plant growth and development
rates and yield components (Richards, 1982a). This strategy can be used for improving
yield potential in a crop by setting selection criteria up on morpho-physiological traits
related to yield and, therefore, make the selection more efficient (Reynolds, 2001). The
limited application of analytical approaches in plant breeding programs for improving yield
3
in major crops have been due partly to lack of or expensive technology to estimate
secondary related traits to yield or limited knowledge about the genetic controls of the
tested traits (Richards, 1982a).
Yield improvements in plants depend on phenology, yield components, adaptive
traits for abiotic and resistance to biotic stresses (Mondal et al., 2016). Yield potential in
soybean can be determined by the following main yield components: the number of nodes
per plant (NP), number of non-reproductive nodes per plant (NRNP), number of
reproductive nodes per plant (RNP), number of pods per plant (PP), and the ratio of
number of pods to number of nodes per plant (P/N) (Pedersen and Lauer, 2004). Seed
size is determined by the rate of seed growth and duration of seed filling, which are mainly
influenced by genetic factors with high heritability (Pedersen and Lauer, 2004). The
number of pods and number of seeds can be regulated by manipulating the crop growth
rate and can be affected by environmental stresses during critical growth stages (He et
al., 2017). Therefore, in order to increase the yield potential in developed cultivars,
improving crop growth rate and minimizing the negative effect of environmental factors
would be of paramount importance (Figure 1.1).
4
Figure 0.1 The genetic and environmental interaction of secondary traits that are related
to the soybean yield.
Measuring yield-related traits in large breeding populations are usually time and
labor-intensive. Therefore, creating and providing researchers with new tools and
techniques to estimate and measure some of these traits in easier and faster ways can
facilitate the breeding process. Spectral reflectance indices (SRI), as one of the high-
throughput phenotyping (HTP) tools, is a new area of research that can be exploited as
an efficient selection tool for improving yield potential and biomass (Araus et al., 2001) in
plant breeding programs. SRI includes a range of formulas, typically based on a sum,
difference, or ratio of two or more wavelengths, that are indicative of important functions
in plant species (Araus et al., 2001; Aparicio et al., 2002; Royo et al., 2003). Spectral
5
reflectance estimated from a canopy is based on this principle that specific traits of a
given plant are associated with the absorption of specific wavelengths of the
electromagnetic spectrum (Ferguson and Rundquist, 2018). Canopy spectral reflectance
is associated with overall leaf area and other photosynthetic components, pigment
composition, and physiological factors of plants (Merlier et al., 2015). Several
physiological parameters, such as environmental stresses, which are more prevailing in
the rainfed area, have also been assessed by spectral reflectance (Araus et al., 2001;
Reynolds, 2001).
The most commonly used hyperspectral vegetation indices (HVI) are the simple
ratio (SR) and normalized difference vegetative index (NDVI) (Araus et al., 2002). Green
biomass, leaf area index, green leaf area index, and a fraction of photosynthetically active
radiation have been reported to be positively correlated with SRI (Wiegand and
Richardson, 1990; Price and Bausch, 1995). Adequate discrimination between high and
low-yielding genotypes of soybeans was reported by using the NDVI (Ma et al., 2001).
According to other studies, Serrano et al. (2000) reported that SR could give a reliable
prediction of wheat yield under different nitrogen levels. Raun et al. (2001) reported that
50% of the yield variability was explained by NDVI in winter wheat under various nitrogen
treatments. Prediction accuracy for grain yield in wheat was increased significantly by
using spectral reflectance indices (Rutkoski et al., 2016b). To conclude, the use of high
throughput phenotyping tools such as HVI can increase the intensity and accuracy of
6
selection as well as response to selections while decreasing phenotyping time and costs
(Muhammed, 2005).
In the past three decades, molecular marker technologies have been developed
and used by breeders to identify desired genomic regions and genes associated with
important quantitative traits to create more efficient molecular-based breeding strategies
that can reduce the negative impact of non-genetic effects on various selections that is
usually involved in classical phenotypic selection methods. Molecular markers associated
with a given complex trait are less influenced by environmental conditions than the
phenotypic values of the trait. Therefore, using reliable molecular markers associated with
a quantitative trait in a breeding program can be more efficient than traditional phenotypic
selection, especially for selection in early generations. Molecular markers and genes
associated with a trait of interest can be identified through different genetic and genomic
approaches, including quantitative trait loci (QTL) mapping using linkage analyses or
genome-wide association study (GWAS). GWAS is one of the most common and exciting
genetic approaches that is mostly used for detecting QTL associated with quantitative
traits (Kaler et al., 2020). The identified QTL or genomic regions may subsequently be
used in marker-assisted selection (MAS) programs for screening large breeding
populations and selecting desirable genotypes in a cost-effective and timely manner
(Xavier et al., 2018). The rationale behind conducting GWAS is to detect QTL associated
with the trait of interest using linkage disequilibrium (LD) – the nonrandom association of
variants at different loci (Bush and Moore, 2012). Over the past two decades, diverse
7
statistical procedures have been developed and used in order to accelerate
computational analyses and improve statistical powers in GWAS in the face of testing
genomewide sets of multiple hypotheses (Tibbs Cortes et al.). These models include, but
are not limited to, the mixed linear model (MLM), multiple loci linear mixed model (MLMM),
compressed mixed linear model (CMLM), enriched compressed mixed linear model
(ECMLM), settlement of MLM under progressively exclusive relationship (SUPER), and
fixed and random model circulating probability unification (FarmCPU) (Tibbs Cortes et al.;
Bush and Moore, 2012). In addition, multiple methods have been proposed for
determining the genome-wide significance levels and thresholds, including the Bonferroni
correction and false discovery rate (FDR) (Bush and Moore, 2012; Kaler et al., 2020). The
application of GWAS was widely studied in different plant species such as: wheat (Eltaher
et al., 2021), maize (Li et al., 2021), soybean (Lin et al., 2020), and sorghum (Somegowda
et al., 2021), and the primary objective of all these studies was to accelerate the breeding
processes using GWAS-derived molecular markers for the indirect selection of superior
genotypes with improved phenotypic values. However, these studies showed that the
successful use of GWAS in uncovering genetic markers governing quantitative traits
relied on selecting appropriate GWAS approaches coupled with accurate experimental
design (Mohammadi et al., 2020).
There are several barriers on the application of conventional statistical methods in
GWAS, using for identifying genomic regions associated with complex traits (Szymczak
et al., 2009). One of the major challenges associated with using the conventional
8
statistical procedures is the “large p, small n” problem that happens in GWAS when these
methods are applied to datasets in which the number of markers (i.e., single nucleotide
polymorphisms (SNPs)) is significantly larger than the number of genotypes (Kaler et al.,
2020; Mohammadi et al., 2020; Xavier and Rainey, 2020). Also, as LD needs to be
considered between a large number of SNPs, conventional statistical methods such as
linear regression may not be well sophisticated for analyzing genome-wide datasets
(Szymczak et al., 2009). It is well documented that current conventional GWAS methods
are only powerful to detect common SNPs with large main effects that can reach the level
of significancy (Lee et al., 2020). Therefore, current conventional GWAS approaches are
underpowered for discovering minor effect SNPs associated with the target traits (Zhou
et al., 2019). Since the successful implementation of GWAS is highly dependent on the
population structure and covariates within the population, using conventional statistical
methods in GWAS requires large and diverse populations in order to detect the QTL with
minor effects (Zhou et al., 2019). Since using very large breeding populations is
expensive in breeding programs, the need for using advanced/sophisticated statistical
models integrated into GWAS is desirable for a better identification of reliable QTL in plant
crops with a narrow genetic basis.
The availability of a large number of plant genetic sequences and phenotyping can
be considered as big data, since they satisfy the following three V’s property: volume,
velocity, and variety. The term big data is referred to the presence of a large dataset that
requires an extensive computational cost to reveal associations, trends, and patterns that
9
may be present in a dataset (Gupta et al., 2020). With the exponential growth of data
using advanced omics-based approaches, efficient analysis of plant datasets using big
data techniques remains a challenge for plant science researchers (Fischer et al., 2020;
Kim et al., 2020b). Efficient analyses of large datasets is significantly dependent on
multiple processes involved in data collection, data processing, and different
management challenges that are identified in the context of big data (Labrinidis and
Jagadish, 2012; Nasser and Tariq, 2015). Data collection and processing challenges
include, but are not limited to, various factors such as volume, variety, velocity, veracity,
volatility, quality, discovery, and dogmatism (Nasser and Tariq, 2015; Al-Abassi et al.,
2020). Therefore, dealing with big datasets such as high-density SNP arrays in GWAS or
hyperspectral reflectance data requires intensive computation and the use of flexible and
efficient mathematical approaches such as artificial intelligence algorithms.
Artificial intelligence (AI) refers to the application of computer technologies to build
intelligent models with minimal human intervention (Hamet and Tremblay, 2017). AI is
broadly used in different areas such as medicine (Topol, 2020), engineering (Sha et al.,
2020), environmental science (Maganathan et al., 2020), and social science (Edelmann
et al., 2020). Recently, in the context of plant science, large investments have been made
for using AI technologies, such as machine learning (ML) algorithms (Correia et al., 2020).
ML is a subset of AI that can be defined as the development of mathematical models that
can learn and educate themselves using available datasets and make decisions
(Alpaydin, 2020). The application of ML algorithms can be considered as an alternative
10
approach to current conventional statistical procedures for analyzing SNP markers in a
data-driven manner. In general, ML algorithms are divided into three main categories:
supervised, unsupervised, and reinforcement learning (Zhang, 2020), which can be used
for analyzing different datasets. Supervised machine learning algorithms can be
employed with labeled datasets, while unsupervised learning algorithms are used to
observe patterns present in unlabeled datasets (Butler et al., 2018).
Regarding the application of different ML algorithms in the area of plant science,
supervised learning methods have drawn the most attention due to the availability of
labeled datasets such as phenotypic data for traits of interest (Chandra et al., 2020). The
development of the predictive classification models using supervised learning algorithms
involves the following five main steps: (a) the determination of the source, type, quality,
and quantity of input training data from an appropriate database (Zhang, 2020), (b) the
use of appropriate feature extraction methods to represent the input data, (c) the training
of the predictive models, (d) the evaluation of the predictive performance on the test
dataset, and (e) the investigation of the importance of features selected for the target trait
(Chandra et al., 2020; Zhang, 2020). The successful use of different ML algorithms for
analyzing high-throughput data was reported in different areas of plant sciences such as
breeding (Yoosefzadeh-Najafabadi et al., 2021b; Yoosefzadeh-Najafabadi et al., 2021c),
biotechnology (Niazian and Niedbała, 2020), genetic engineering (Hesami et al., 2020a),
physiology (Yamamoto, 2019), and precision agriculture (Chlingaryan et al., 2018). More
specifically, the effective use of ML algorithms was reported in omics-based technologies
11
such as genomics (Dumschott et al., 2020; Mahood et al., 2020), phenomics (Esposito et
al., 2020), metabolomics (Tugizimana et al., 2020), transcriptomics (Mahood et al., 2020),
and proteomics (Yadav and Singla, 2020).
Most of the studies on using HVI for predicting phenotypic performance have been
carried out with a small number of genotypes and very specific wavelengths of the
spectrum. As a result, investigation on the use of HVI in a breeding program as a potential
indirect selection criterion has been restricted. The practical use of spectral reflectance
as an indirect selection tool requires the identification of appropriate HVI and growth
stage(s) to detect genotypic differences for grain yield and biomass. Therefore, the
appropriate use of genotyping approaches combined with accurate phenotyping
approaches such as whole reflectance bands or HVI can help to increase prediction
accuracy and enhance genetic gains by improving the genetic background and shortening
breeding cycles.
12
1.2 Literature Review
1.2.1 Introduction of soybean
Soybean (Glycine max L.) is an annual, self-pollinated species that belongs to the
Leguminosae family (Schmutz et al., 2010). This species originated from wild soybean
Glycine soja, with 2n = 40, exhibit normal meiotic chromosome pairing, generate viable,
and fertile hybrids(Kim et al., 2010). Although the exact region of origin of soybean is still
unknown, southern and northeastern parts of China and several other Asian regions are
all candidate sources because G. soja grows naturally in far eastern China, Russia,
Japan, and Korea (Hyten et al., 2006).
1.2.2 Importance of soybean
Soybean consists of not only significant protein and oil contents but also several
phytochemicals such as isoflavones and phenolic compounds (Kim et al., 2012). Also,
soybean contains flavonoids (Wang and Murphy, 1994), which are suitable for reducing
the risk of osteoporosis, heart/cardiovascular diseases, and cancer (Crozier et al., 2009).
Furthermore, phenolic components are abundant in the seed of soybean (e.g., coumaric
acid, caffeic acid, and chlorogenic acid) (Kim et al., 2006), which are very useful for
human health because of their antioxidant activities (Tyug et al., 2010). According to the
fiber of the soybean, they limit high levels of blood sugar and maintain the level under the
threshold (Fabiyi, 2006). Therefore, soybean is known as a critical source for human
13
health benefits because of the pharmacological values and nutritional elements. Thus,
increasing the yield of soybean is of paramount importance.
1.2.3 Yield enhancement from cultivar development
Specht et al. (1999) reported that the genetic gain from soybean cultivar
development varies from region and the time period studied. In another study, Boerma
(1979) compared 18 southern cultivars released from 1942 to 1973 and indicated an
annual yield increase of 13.7 kg ha-1. Although Canadian researchers working with very
early maturing cultivars (0, 00, and 000) reported no genetic gain in yield from 1934 to
1976, yield gain from 1976-1992 was 30 kg ha-1 year-1 thanks to the introduction of cold-
tolerant cultivars in 1976 (Voldeng et al., 1997). In Canada, yields of short-season
soybean cultivars have increased by 0.5% year –1 (Morrison et al., 2000).
1.2.4 Yield components
Several types of research have tried to describe yield improvement in plants via
introducing specific yield components. There is a significant annual increase in seed size
averaging 0.1 g year –1 in soybean that is related to the increase in the yield of this
valuable plant (Parry et al., 2010). However, other studies indicated that the relative
importance of seed number and seed size in explaining greater yield in the cultivar
development process might depend on cultivar and their interaction with the environment
(Okamoto et al., 1995; Cui and Yu, 2005; Kim et al., 2006). Also, there is a positive
14
correlation between soybean yield and pod number (Fabiyi, 2006). Seed number per plant
was reported to be a significant contributor to yield gain in Northeast China (Huang et al.,
2010). More recent studies demonstrated that yield improvement was much more strongly
related to seed number per area than seed size (De Bruin and Pedersen, 2009). The
authors also reported that greater seed number per area appeared to be related to larger
seeds per pod, although other yield components were not examined. Comprehensive
research from China indicated that greater yield occurred via increases in nodes per plant,
pods per plant, and seed per pod (Cui and Yu, 2005). The seed number per plant is
determined by the sequential influences of node number per plant, reproductive node
number per plant, and pod number per plant, all of which being formed during the R1-R6
period (Board et al., 2003; Cui and Yu, 2005; Board et al., 2010; Connor et al., 2011).
Based on the controversy of results from different studies, it appears that yield
improvement can occur through various yield component mechanisms that most of them
are not clearly explored yet (Figure 1.2).
15
Figure 0.2 The scheme of the relationship of yield components and their determining
factors.
1.2.5 Spectral reflectance
Hyperspectral reflectance has been used by researchers to measure different plant
characteristics. Hyperspectral reflectance is based on the principles of light reflectance
and absorbance in the canopy (Kumar and Silva, 1973). The range of visible light from
400 to 700 nm is absorbed by the epidermis and palisade mesophyll due to the presence
of chloroplast and plant pigments that use the light to drive photosynthesis (Woolley,
1971; Kumar and Silva, 1973). Chlorophyll a, chlorophyll b, and plant pigments absorb
light to drive photosynthesis (Ji et al., 2006). Near-infrared (NIR) light is reflected in the
band 700-1300 nm, which is not mainly absorbed by the plant to drive photosynthesis, so
16
when it hits the spongy mesophyll in the plant, it is reflected out of the plant (Woolley,
1971). One of the most vital portions of the plant reflectance cure is the red edge because,
in this region, plant pigments stop absorbing light and begin to reflect it (Cabrera‐Bosquet
et al., 2012). Generally, with these measurements of the light that is reflected by the plant
canopy, chlorophyll content, biomass, yield, abiotic stress, and disease can be estimated
(Ford et al., 1983; Karmakar and Bhatnagar, 1996; Kandel et al., 2016).
1.2.6 Hyperspectral vegetation indices (HVI)
Most applications using HVI measurements rely on ratios of reflectance values
measured from various wavelengths (Figure 1.3). One of the most used reflectance
measurements to estimate total green biomass is NDVI (Deering, 1978; Tudcer et al.,
1978). NDVI estimates biomass by using a relationship between the NIR reflectance and
the reflectance of red light as follows; NDVI= (NIR-Red) / (NIR +Red). In order to achieve
a good perspective of using reflectance indices, Aparicio et al. (2000) used a simple ratio
SR = (NIR/ Red) and NDVI, to estimate grain yield. Based on that study, SR accounted
for 52% and NDVI 59% of the yield variation under dryland conditions. NDVI reflectance
was used to model total yield for the growing season for forage sorghum and accounting
for 80% of the yield variation (Tagarakis et al., 2017). In another study, Royo et al. (2007)
used multiple reflectance wavebands and indices, such as the 550nm, 680nm, water
index (WI), as well as the NDVI and SR to predict grain yield in wheat and indicated that
reflectance at 680 nm explained 24% to 47% of yield phenotypic variation, while water
17
index accounted for 17% to 32 % of the variation in yield. Also, the SR was more useful
in predicting yield accounting for 19% to 35% of the yield variation. However, Ray et al.
(2013) reported that both NDVI and SR failed to predict grain yield accurately in
environments with decreased biomass. Other studies such as Fitzgerald et al. (2010)
used NDVI to select up to 80% of the 20% highest yielding varieties in recombinant inbred
lines (RILs) under irrigation. This study, same as Weber et al. (2012), showed that yield
estimation might be useful for the estimation of yield at early growth stages. Regarding
using HVI in wheat, it can be understood that reflectance measurements were taken at
early grain fill and milk stage in wheat were most informative in estimating yield than those
taken closer to maturity (Aparicio et al., 2000; Brown-Guedira et al., 2000; Board et al.,
2010; Weber et al., 2012).
Figure 0.3 The region of each reflectance bands used for constructing HVI. SRa1: Simple
ratio (845 nm), SRa2: Simple ratio (905 nm), RARSa: Ratio analysis of reflectance
18
spectra chlorophyll a, NDVI: Normalized difference vegetation index, GNDVI: Green
normalized difference vegetation index, RARSc: Ratio analysis of reflectance spectra
chlorophyll c, RARSb: Ratio analysis of reflectance spectra chlorophyll b, BNDVI: Blue
normalized difference vegetation index, and NPQI: Normalized pheophytinization index
1.2.7 Application of HVI for predicting yield
According to the previous study, Marti et al. (2007) investigated wheat yield and
biomass at the milk stage by using NDVI that accounted for 77% of the variation in
biomass and 75% of the variation in wheat yield. Fitzgerald et al. (2010) used NDVI as a
selection tool in wheat breeding to select 20-80% of the top 20% best performing varieties
in a three-year study. In another study, Royo et al. (2007) reported that reflectance
measurements taken at the milk stage of wheat development were the most accurate
predictive for yield. Also, Aparicio et al. (2000) used reflectance measurements to
characterize the yield of durum wheat in both irrigated and dryland plots, and they found
that dryland plots explained more of the phenotypic difference in yield when compared to
irrigated treatments. Therefore, they used water stress indexes created from reflectance
measurements to estimate the effects of water stress on wheat yields and were able to
explain up to 87% of the phenotypic variation in yield. However, Weber et al. (2012)
applied a partial least squares model for estimating corn yield from reflectance spectra
but were only able to account for 40% of the phenotypic variation in yield, which would
limit the effectiveness of the tool for selection due to the large portion of unexplained
19
variation. On the contrary, Sakamoto et al. (2013) used reflectance data from satellites to
estimate corn yields and accounted for approximately 75% of the yield variation at a
regional level. Ma et al. (2001) were the first to use spectral reflectance measurements
to quantify grain yield in soybean, using historical lines with known yield differences and
found that by measuring NDVI taken at the R5 growth stage explained from 45% to 80%
of the phenotypic variation in soybean yield, depending on environment and year (Ma et
al., 2001). They suggested that measurements taken at R5 were the most informative
and reduced the effects of soil reflectance on the measurement (Ma et al., 2001). Also,
Christenson et al. (2016) obtained similar results when examining soybean for differences
in spectral reflectance as a predictor of grain yield. Their models explained 44% of the
variation in yield when derived from individual waveband measurements and 58% of the
yield variation when the model was based on reflectance indexes.
1.2.8 Measuring other traits through spectral reflectance
Light in the NIR region does not drive cellular function but has been shown to be
related to plant water content (Fukushima, 1981). Cellular structure in the plant leaf also
affects the path of light as it travels through the palisade mesophyll cells where most
photosynthesis occurs. When the light reaches the spongy mesophyll, the remaining
photosynthetic active light is absorbed, and the NIR is scattered and reflected due to the
air in the leaves (Kumar and Silva, 1973).
20
Spectral reflectance has been used to identify disease symptoms in crops (Adee,
2015). Muhammed (2005) used reflectance to estimate the fungal disease severity of
wheat, accounting for 95% of the variation between disease estimates and observed
ratings. Furbank and Tester (2011) used reflectance to quantify powdery mildew in wheat
and reported that this measurement could identify 95% of infected plants and 78% of
healthy plants. According to another study, Vigier et al. (2004) used spectral reflectance
to estimate Sclerotia stem rot in soybean. Nutter Jr et al. (2002) used spectral reflectance
to estimate the initial infection of soybean with soybean cyst nematode (SCN), yield,
protein, and oil. By using reflectance at 810nm, they could account for 48% of the initial
SCN population in the soil, which was shown to negatively influence the yield. Bajwa et
al. (2017) used spectral reflectance to characterize SCN severity in soybean, and they
used models with individual wavelengths and spectral indices to estimate disease
severity. Based on their results, reflectance models identify 94% of healthy plants but
were less successful at predicting diseased plants. Fletcher et al. (2014) examined
charcoal root rot in soybean relationship with spectral reflectance, and they reported an
overall decrease in reflectance with increased disease pressure across all wavebands,
with reflectance in the NIR most correlated to pathogen content in the plant.
1.2.9 Genome-wide association studies (GWAS)
Association mapping, also known as LD mapping, has been brought up to
investigate the traits down to the sequence level by exploiting the variation and extent of
21
LD within the population (Nordborg and Tavaré, 2002). Association mapping is divided
into two categories, which are candidate-gene association mapping and GWAS analysis.
Candidate-gene association is usually adopted if candidate genes for targeted traits are
available. On the other hand, GWAS, also known as a genome scan, searches the whole
genome comprehensively for causal genetic variation. Due to insufficient DNA markers
in early association studies, the candidate-gene association was used to identify SNPs
controlling the targeted traits (Bao et al., 2014). The occurrence of high throughput DNA
sequencing technology provides a platform for GWAS. GWAS was conducted in soybean
to identify markers linked to iron deficiency chlorosis (Mamidi et al., 2011), chlorophyll
(Hao et al., 2012), flowering time, maturity dates, and plant height (Zhang et al., 2015c).
An important factor to be considered when using GWAS is the extent of LD. LD
refers to the degree of non-random association of alleles at different loci. LD is affected
by numerous factors, including recombination events, drift, selection, reproduction mode,
and population admixture (Flint-Garcia et al., 2003; Gaut and Long, 2003). The factors
such as inbreeding, small population size, low recombination rate, population admixture,
natural and artificial selection lead to an increase in LD. Other factors, including
outcrossing, high recombination rate, and high mutation rate, lead to a decrease in LD
(Gupta et al., 2005). The two most commonly used statistics to describe LD are r2 and D’.
The r2 is considered as the square of the Pearson’s correlation coefficients between two
loci, and the D’ statistic is the partially normalized D value based on the observed
haplotype frequency (Hill and Robertson, 1968). The structure of LD across the genome
22
determines the resolution of association mapping. Soybean has a low level of genome-
wide LD decay rates and r2 decays to less than 0.10 at a genetic map distance of more
than 2.5 centimorgan (cM) (Hyten et al., 2007). The LD extends from 90 to 574 kb in G.
max due to domestication and increased self-fertilization (Hyten et al., 2007). A high
marker density is required in the regions with low LD for GWAS (Hwang et al., 2014). One
of the major uses of LD is to study marker-trait association in plant genomics research.
Compared to traditional linkage analysis, LD-based association mapping provides a more
precise location of QTLs. Mapping a QTL within a narrow chromosome region is possible
through LD, but not by linkage analysis because recombination within a narrow
chromosome region is not always available in a mapping population (Mackay, 2001).
Also, regression analysis is used to measure the LD between a marker and a QTL;
significant regressions indicate the association between the marker and phenotype
(Remington et al., 2001). The regression analysis can also be conducted by testing the
association between marker haplotypes and phenotype. A significant association
between the marker haplotypes and phenotypic effects provides more powerful evidence
for the presence of a QTL (Hayes and Goddard, 2001). However, GWAS are confronted
by the problem of spurious association due to population structure and kinship
relatedness. To control the false association error rate, some statistical models have been
proposed. General linear model (GLM)-based methods including structured association
(Pritchard and Donnelly, 2001), genomic control (Devlin and Roeder, 1999), family-based
tests of association (Thomson, 1995), and principal component analysis (Price, 2006),
23
were initially used. Mixed linear model (MLM)-based methods such as the unified mixed-
model method (Yu et al., 2005), compressed MLM (Zhang et al., 2010), efficient mixed-
model association (Zhou and Stephens, 2012), and multi-locus mixed model (Segura et
al., 2012) have been used to correct for genetic relatedness and population structure and
have successfully increased the computational speed and reliability.
1.2.10 Conventional Statistical methods used in GWAS
1.2.10.1 Single-SNP test
In order to facilitate comparisons between inheritance effects of genotypes, the
conventional single-SNP testes should be established. As an example, assume G and A
as the mutant wild-type alleles, respectively, from a diploid plant. A recessive model
makes a comparison between GG with AG+AA. An additive or co-dominant model will
consider GA in between AA and GG, while a dominant model reduces the comparisons
among AA, AG, and GG to between GG + AG and AA (Sun et al., 2020a). Therefore,
different models can make a significant difference in the result of GWAS (Emily, 2018).
As one of the common recommendations, it is necessary to screen the SNPs in the
population by using the co-dominant model and, then, conduct an GWAS test by
implementing all genetic models with the use of the SNPs with significant effects (Emily,
2018). After selecting an appropriate genetic model, different statistical tests such as
Fisher’s exact test, the odds ratio test, the analysis of variance (ANOVA), the chi-squared
test, the t-test, and Cochran–Armitage’s trend test can be used for the single-SNP tests
24
(Nakaoka and Inoue, 2009; Sun et al., 2020a). Each of the statistical tests is well
explained in Sun et al. (2020a). Linear models such as MLM, CMLM, ECMLM, and
SUPER are usually implemented as single-SNP tests in GWAS without the necessity of
determining the genetic models. It is possible to add systematic genetic differences and
covariates in the linear models for removing the spurious associations produced by
sampling errors and biases in study design. However, the statistical power of the test will
be decreased by allocating additional degrees of freedom (Sun et al., 2020a).
1.2.10.2 Multi-SNP test
Since complex traits are controlled by various QTL, single-SNP tests fail to extract
the joint effects of multiple alleles (Yang et al., 2012). Therefore, multi-SNP tests are
required to enhance the accuracy of GWAS analyses. However, the main problem of
multi-SNP tests arises from “large p, small n” datasets (Yang et al., 2012). One of the
probable solutions is the use of Bayesian, penalized, or stepwise approaches to provide
SNP tagging (Sun et al., 2020a). SNP tagging is the process of selecting a subset of
SNPs that store the maximum information from the full set of SNPs based on their LD
values (Sun et al., 2020a). LD values are used for detecting SNPs with the same
information and overlapping effects on the target traits (Yang et al., 2012). While different
methods have been developed based on the level of LD between two given SNPs, the
consistency of tagging between methods has low power and, therefore, those methods
need to be selected carefully for GWAS analyses in different plant species and from a
25
study to another (Ding and Kullo, 2007). The major drawback for stepwise regression in
GWAS is that this method is mostly overfitted, and the heritability estimations through this
method are not accurate (Harrell Jr, 2015; Sun et al., 2020a). Bayesian methods,
including Bayesian penalized regressions (Banerjee et al., 2018; Sun et al., 2020a) and
Bayesian variable selection (Banerjee et al., 2018; Zhao et al., 2019), have the advantage
of being sensitive to SNPs with small effect because they are free from the multiple testing
problem (Stephens and Balding, 2009; Zhao et al., 2019). However, Bayesian methods
also need prior knowledge, such as the distribution of the effect of the trait of interest and
probability of SNPs that are associated with it in the tested population (Stephens and
Balding, 2009).
1.2.10.3 Application of GWAS analysis in plants
Genome-wide association studies are currently considered as the common step to
discover QTL associated with the traits of interest. However, more and more research is
required to investigate the transferability and reproducibility of GWAS results across
different genetic backgrounds and environments (Mohammadi et al., 2020). The recent
application of GWAS in different plant species such as soybean, wheat, dry bean, and
maize is represented in Table 1.1. Many of the previous studies have reported
inconsistent QTL using GWAS for complex traits in different genetic backgrounds and
across different environments (Table 1.1). Over the past decade, thousands of QTL were
identified using different conventional GWAS methods (MacArthur et al., 2017). However,
26
a limited number of detected QTL are currently used for marker-assisted selection in plant
breeding programs due mainly to their inconsistent effects on the traits (Mohammadi et
al., 2020).
Table 0.1 The summary of the most recently published papers on the application of GWAS on different plant species.
Plant Species
Purpose Trait(s) studied
Methods
Marker types used
Number of markers used
Main findings
Reference
Soybean
Detection of genomic regions controlling storage protein subunits
Glycinin (11S), β-conglycinin (7S), sum of glycinin and β-conglycinin (SGC), the ratio of glycinin to β-conglycinin (RGC)
CMLM
SNP 207,608
Three stable QTL detected for glycinin (11S), one for β-conglycinin (7S), one for SGC, and one for RGC.
Zhang et al. (2021b)
Discovery of genomic regions underlying important diseases in soybean
Resistance to the root‐knot nematode
CMLM SNP 44,992
Four SNPs were associated with root-knot nematode and located in the same region of chromosome 13
Alekcevetch et al. (2021)
Identification of genomic regions related to
Sucrose CMLM SNP 33,149
Five quantitative trait nucleotides (QTNs)
Sui et al. (2020)
27
soluble sugar
associated with sucrose concentration were detected in two or more environments.
Detection of genomic regions associated with the soybean fatty acids
Oleic acid content
GLM MLM, CMLM FAST-LMM
SNP 2,311,337
There were not detected any environmentally stable QTL.
Liu et al. (2020)
A better understanding of the major determinants of soybean yield by using single and multi-locus GWAS methods
Seed weight
SLM MLM
SNP 61,166
Nine SNPs were detected through two MLMs in at least three environments or at least through MLM in at least three environments as well one of SLMs in one environment.
Karikari et al. (2020)
Identification of genomic regions associated with the seed sulfur amino acids content
Cysteine and methionine content
MLM
SNP 56,000 Nine QTL reported.
Malle et al. (2020)
Wheat
Selection of genetic loci conferring
Resistance to leaf rust
MLM SNP 12,931 Seven QTL for leaf rust and six QTL
Zhang et al. (2021a))
28
fungus resistance
and stripe rust
for stripe rust were identified in multiple environments.
Genetic improvement of root system architecture to improve water and nutrient use efficiency of crops to increase their productivity under stress soil conditions
Total root length (TRL), total root number (TRN), root growth angle (RGA), average root length (ARL), bulk root dry weight (RDW), individual root dry weight (IRW), bulk shoot dry weight (SDW), presence of six seminal roots per seedling (RT6) and root shoot ratio (RSR).
MLM SNP 10,789
Eight QTL for TRN, six for SDW, six for IRW, three for TRL, three for RT6, two for RDW, one QTL for RGA, ARL, and RSR were detected.
Alemu et al. (2021)
Facilitating the incorporation of nematode resistance sources into new high-
Resistance to cereal cyst nematode
MLM SNP 9,427
Seven additive QTL were identified, and these QTL explain a maximum of 9.42%
Dababat et al. (2021)
29
yielding cultivars
phenotypic variation.
Understanding the genetic basis of salt tolerance for breeding and selecting new salt-tolerant cultivars.
Plant height, spike number, spike length, grain number, thousand kernels weight, yield, and biological mass
MLM SNP 387,657
Eleven QTLs were associated with plant height, spike number, spike length, grain number, thousand kernels weight, yield, and biological mass under different salt treatments
Hu et al. (2021)
Identification of genetic loci associated with grain yield traits, and quality traits.
Protein content, Starch content, Moisture, and Zeleny sedimentation value
SLM SNP 9,290
Two SNPs on chromosome 6A and one SNP on chromosome 1B were significantly associated with quality trait moisture, as well as one SNP located on chromosome 5B associated with starch content in the seeds.
Tsai et al. (2020)
Identification of genomic
Resistance to Karnal
MLM SNP 6,382
Fifteen significant SNPs on
Singh et al. (2020)
30
regions conferring resistance to bunt and smut Diseases.
Bunt (KB) disease
chromosomes 2D, 3B, 4D, and 7B that were associated with KB resistance in an individual year
Dry Bean
Improving consumer acceptance and sustainability in cooked beans
Flavor intensity, beany, earthy, starchy, bitter, seed-coat perception, and cotyledon texture
BLINK MLM
SNP 31,273 Six SNPs across three chromosomes for flavor intensity, five SNPs associated with beany across four chromosomes, three SNPs across two chromosomes for earthy, one SNP for starchy, one SNP for bitter, three SNPs across two chromosomes for seed-coat perception, and two SNPs across two chromosomes for cotyledon texture were identified
Bassett et al. (2021)
Identifying genomic
Root surface
GLM CMLM
SNP 3,783,033
2542 and 408 significant
Wu et al. (2021)
31
regions associated with drought resistance on root traits
area, root average diameter, root volume, total root length, taproot length, lateral root number, root dry weight, lateral root length, special root weight/length, using seed germination pouches under drought conditions were evaluated for root traits
SNPs were detected by GLM and CMLM, respectively. 389 associated SNPs were detected by both models.
Genome rigon associated with resistance to anthracnose in climbing beans.
Resistance to anthracnose race 73
MLM SNP 78,754 Significant SNPs (S07_8,726,264 and S04_527,782) were found associated with resistance on chromosome four and seven
Maldonado-Mota et al. (2020)
Detection of genomic region controlling
White mold MLM FAST-MLM
SNP 9070 Fifteen QTL associated to white mold were identified
Compa et al. (2020)
32
white mold on dry bean.
Maize
Detection of genomic regions controlling the dry matter accumulation and partitioning.
Organ’s dry matter, dry matter, and organ’s dry matter partition coefficient
MLM SNP 779,855 Twenty-two, Twenty, and fourteen SNPs were detected for the leaf dry matter, total aboveground dry matter, and sheath dry matter, of the V3 stage, respectively. Moreover, 27, 25, 22, and 21 SNPs were identified for the leaf dry matter, total aboveground dry matter, sheath dry matter, and stalk dry matter of the V6 stage, respectively.
Lu et al. (2021)
Detection of genomic regions controlling Northern Maize Leaf Blight (NCLB).
NCLB resistance
MLM SNP 955,690 Twenty-two SNPs related to NCLB were detected which most of them (17 SNPs) were located in the same chromosome.
Rashid et al. (2020)
Detection of genomic regions
Fusarium ear rot resistance
MLM SNP 560,000 Thirty-four SNPs related to Fusarium
Yao et al. (2020)
33
controlling Fusarium ear rot.
ear rot were detected
Detection of genomic regions controlling stemborer.
Egg-induced parasitoid attraction
GLM ML
SNP 316,127 101 MTAs related to terpene biosynthesis, JA-defence pathway, benzoxazinone synthesis, and resistance genes were identified.
Tamiru et al. (2020)
Detection of genomic regions controlling grain drying rate (GDR)
GDR MLM SNP 217,933 Seven SNPs and six candidate genes related to the GDR were discovered
Jia et al. (2020)
Detection of genomic regions controlling odor and aroma
Odor and aroma components
MLM SNP 49,585 Eighty SNPs and 64 genomic regions related to the 15 ubiquitous volatiles were detected.
Alves et al. (2020)
Detection of genomic regions controlling the architecture of corn husk tightness
Husk layer number (HLN), husk length (HL), husk width (HW), husk thickness (HT), and husk tightness (HTI)
MLM SNP 1,253,814
Twenty-seven candidate genes related to HTI involved in husk morphogenesis, husk senescence, and cell signal transduction were detected.
Jiang et al. (2020a)
34
1.2.11 Artificial Intelligence Techniques
Recently, several studies recommended that there is a significant need to improve
the methodology of GWAS that would be suitable for a wide range of research disciplines
(Tibbs Cortes et al.; Lee et al., 2020; Sun et al., 2020a). Improving the GWAS
methodologies can be done by using any of the ML approaches in different plant species.
In these machine learning algorithms, SNPs and the phenotypic traits of interest will be
considered as input and output variables, respectively.
1.2.11.1 Machine Learning (ML) Algorithms
ML algorithms are emerging computational strategies in plant science studies with
an unprecedented capacity for analyzing complex nonlinear and multivariable systems,
which are frequently observed in plant biological processes (Hesami and Jones, 2020a).
The successful application of ML in different fields of plant science has been extensively
Detection of genomic regions controlling the white spot
White spot resistance
MLM SNP 355,972 Nine SNPs related to white spot, which of five of them co-localized on previously reported QTLs were identified.
Rossi et al. (2020)
BLINK: bayesian-information and Linkage-disequilibrium Iteratively nested keyway; MLM: mixed linear model; CMLM: Compressed mixed linear model; SLMs: Two single-locus models; SNP: Single-nucleotide polymorphism; FAST-LMM: Factored spectrally transformed linear mixed models; GWAS: Genome-wide association studies; QTL: Quantitative trait loci (QTL) : QTN: Quantitative trait nucleotides; MTAs: Marker-trait associations.
35
reviewed by Silva et al. (Silva et al., 2019). In general, ML can be considered as a group
of algorithms that are used to understand underlying patterns present in big datasets
(Dias and Torkamani, 2019). In different areas of plant science (e.g., phenomics,
genomics, and transcriptomics), ML concepts can be organized into four groups: (i)
identification, (ii) classification, (iii) quantification, and (iv) prediction (Wang et al., 2020).
These four groups are categorized as a sequence of feature extraction steps where the
dataset growingly provides more information. In GWAS, two approaches should be
satisfied to obtain optimum results: reliability and accuracy (Eraslan et al., 2019).
Reliability relates to the degree to which assessments of the same parameters achieved
under various conditions generate similar outcomes. Indeed, reliability refers to the
magnitude of the error assessment in the observed assessments to the inherent variation
in the ‘true’ sample (Eraslan et al., 2019). Accuracy refers to the closeness of a predicted
value to the actual data, which is greatly affected by the raters of experience levels
(Eraslan et al., 2019). Generally, by removing the raters fatigue, reducing bias, and using
appropriate feature extraction techniques, both the accuracy and reliability of the ML
algorithms can be greatly improved (Hesami and Jones, 2020a).
1.2.12 Optimization algorithm
Recent studies have used a genetic algorithm (GA) to reduce computational
volumes (Arab et al., 2018). GA, as the best-known optimization algorithm, causes to
achieve the optimal solutions with minimal computing. On the other hand, plant
36
phenomics challenges must satisfy various objective functions by considering different
constraints. However, GA, as a single-objective algorithm, cannot optimize multi-objective
functions, simultaneously (Bozorg-Haddad et al., 2016). Therefore, the multi-objective
algorithm has been required for the optimization of outputs. Classical optimization
methods consist of multi-criterion decision-making methods providing the converting
model of the multi-objective optimization problem to a single-objective optimization
problem by emphasizing one particular Pareto-optimal solution at a time. Considering this
method for multiple solutions, it has to be applied so many times for finding various
solutions at each simulation run (Bozorg-Haddad et al., 2016). The Non-dominated
Sorting Genetic Algorithm-II (NSGA-II) and SPEA2 have been known as the common
evolutionary multi-objective optimization algorithms, attempting to find the solution
domain for discovering Pareto-optimal solutions within a multi-objective centered scheme
(Wang et al., 2011). Based on our knowledge, there is a lack of comprehensive study in
linking ML algorithms to optimization algorithms such as genetic algorithm (GA) and
SPEA2 for modeling and optimizing a complex system of yield components and HVI in
order to predict maximum potential yield, as an optimal solution.
1.2.13 Application of AI-mediated GWAS analysis in plants
To the best of our knowledge, only a few studies were conducted regarding the use
of ML algorithms in GWAS for detecting the QTL associated with traits of interest (Table
1.2). These studies were focused on using the random forest algorithm to detect SNP
37
markers associated with some quantitative traits. However, the potential use of different
ML algorithms for improving the statistical power of GWAS in the area of plant science is
still unknown. Therefore, further research and study is required to select the optimal
GWAS method suitable for each plant species and the traits of interest based on their
genetic bases and environmental dependency.
1.2.14 Evaluating thresholds in AI-mediated GWAS analysis
In AI-mediated GWAS analyses, the significance levels or thresholds for identifying
SNPs-trait associations are estimated using variable importance methods, which are
different from statistical methods that are used for estimating p-values in conventional
GWAS (Szymczak et al., 2009). The main advantage of using variable importance, rather
than p-values, for individual SNP-trait GWAS tests, consists in the ability of these
Table 0.2 The summary of all published papers on using Artificial Intelligence-mediated Genome-Wide Association Studies (GWAS) analysis in soybean.
Purpose Trait(s) studied Methods Marker types used
Number of
markers used
Reference
Genetic architecture of yield-related traits
Grain yield; Number of pods; number of nodes; number of days to maturity; number of pods to nodes ratio
Random forest
SNP 4,240 Xavier and Rainey (2020)
Explore minor QTL by using
soybean branching Random Forest boosting
SNP 42,080 Zhou et al. (2019)
QTL: Quantitative trait loci; SNP: Single-nucleotide polymorphism.
38
approaches to consider the interaction effects between SNPs (Szymczak et al., 2009;
Asif et al., 2020). Conventional GWAS are appropriate approaches for detecting SNPs
with large main effects on complex traits; however, they are underpowered for the
simultaneous consideration of a wide range of interconnected biological processes and
mechanisms that shape the phenotype of complex traits (Lee et al., 2020). Recent studies
showed that SNPs with high importance scores are not necessarily the SNPs with
significant p-values resulted from single SNP analyses (Arshadi et al., 2009; Grömping,
2009; Szymczak et al., 2009; Ziliak, 2017; Di Leo and Sardanelli, 2020). Therefore, using
variable importance values in ML algorithms for identifying SNP-trait associations may
improve the power of AI-mediated GWAS for discovering variant-trait association with
higher resolution (Szymczak et al., 2009). The variable importance methods based on
linear and logistic regressions, support vector machines, and random forests are well
established in the literature (Grömping, 2009; Wu and Liu, 2009b; Chun and Keleş, 2010;
Williamson et al., 2020; Yoosefzadeh-Najafabadi et al., 2021b). The schematic use of
variable importance is depicted in Figure 2. Tang et al. (2009) have introduced a new
genetic-level importance measure based on scaled variable importance for SNPs, where
the importance value of each SNP was defined as the mean or maximum importance of
all SNPs in an GWAS panel (Tang et al., 2009). Here, we recommend the use of a
percentage scale for estimating variable importance. Since there is no confirmed way for
defining the significant threshold in using AI-mediated GWAS, the global empirical
threshold can be implemented for providing the empirical distribution of the null
39
hypothesis (Churchill and Doerge, 1994; Doerge and Churchill, 1996). In an AI-mediated
GWAS method, Xavier and Rainey (2020) estimated the global empirical threshold for
genetic dissection of soybean yield components based on fitting the algorithm, storing the
highest variable importance, repeating the process 1000 times, and eventually selecting
the SNPs based on a confidence level α=0.05. One of the limitations of this method is
that the number of repetitions and the level of significance (α) should be optimized based
on: the genetic background of the target plant species, the number of molecular markers,
the distribution of the phenotypic data for the target trait, and the number of data points.
Figure 0.4 Schematic flowchart of implementing variable importance in AI-
mediated GWAS analysis.
40
1.3 Thesis Hypothesis and Objectives
1.3.1 Hypotheses
1) The variation of yield in the population, among the soybean genotypes, can be partially
explained by the secondary yield-related traits.
2) Some of the secondary yield related traits can be employed to be used as inputs for
different machine learning algorithms to predict complex traits such as yield at early
growth stages.
3) Optimization algorithms can be used in soybean breeding programs to estimate the
optimum values of secondary yield related traits in cultivars with maximized yield and
fresh biomass.
4) The phenotypic variations of yield and its related traits are partially explained by genetic
variation among the soybean genotypes, which can be detected to some extent using
appropriate genetic and genomic tools.
1.3.2 Objectives
1) Study the relationship between seed yield and some of important secondary traits in a
panel of 250 soybeans in the field.
41
2) Investigate the potential use of soybean yield components and hyperspectral
reflectance for predicting the final seed yield using individual machine learning algorithms
as well as ensemble learning methods.
3) Link the best machine learning algorithm with optimization algorithm to estimate the
optimal values of the yield components and hyperspectral vegetative indices in soybean
genotypes with maximized yield.
4) Discover genomic regions associated with the target traits using genome wide
association study (GWAS) based on conventional statistical methods and machine
learning algorithms.
42
Chapter 2: Application of Machine Learning Algorithms in
Plant Breeding: Predicting Yield from Hyperspectral
Reflectance in Soybean
Published in Frontiers in Plant Science Journal
Mohsen Yoosefzadeh-Najafabadi,1 Hugh Earl,1 Dan Tulpan,2 John Sulik,1 Milad
Eskandari1
1. Department of Plant Agriculture, University of Guelph, Guelph, ON N1G 2W1, Canada
2. Department of Animal Bioscience, University of Guelph, Guelph, ON N1G 2W1,
Canada
*Corresponding author. Email: [email protected]
Yoosefzadeh-Najafabadi, M., Earl, H. J., Tulpan, D., Sulik, J., & Eskandari, M. (2021). Application of machine learning algorithms in plant breeding: predicting yield from hyperspectral reflectance in soybean. Frontiers in plant science, 11, 2169. https://doi.org/10.3389/fpls.2020.624273
43
1.4 Abstract
Recent substantial advances in high-throughput field phenotyping have provided
plant breeders with affordable and efficient tools for evaluating a large number of
genotypes for important agronomic traits at early growth stages. Nevertheless, the
implementation of large datasets generated by high-throughput phenotyping tools such
as hyperspectral reflectance in cultivar development programs is still challenging due to
the essential need for intensive knowledge in computational and statistical analyses. In
this study, the robustness of three common machine learning (ML) algorithms, Multilayer
Perceptron (MLP), Support Vector Machine (SVM), and Random Forest (RF), were
evaluated for predicting soybean (Glycine max) seed yield using hyperspectral
reflectance. For this aim, hyperspectral reflectance data for the whole spectra ranged
from 395 to 1005 nm, were collected at the R4 and R5 growth stages on 250 soybean
genotypes grown in four environments. The Recursive Feature Elimination (RFE)
approach was performed to reduce the dimensionality of the hyperspectral reflectance
data and select variables with the largest importance values. The results indicated that
R5 is more informative stage for measuring hyperspectral reflectance to predict seed
yields. The 395 nm reflectance band was also identified as the high ranked band in
predicting the soybean seed yield. By considering either full or selected variables as the
input variables, the ML algorithms were evaluated individually and combined-version
using the Ensemble-Stacking (E-S) method to predict the soybean yield. The RF
44
algorithm had the highest performance with a value of 84% yield classification accuracy
among all the individual tested algorithms. Therefore, by selecting RF as the
metaClassifier for E-S method, the prediction accuracy increased to 0.93, using all
variables, and 0.87, using selected variables showing the success of using E-S as one of
the ensemble techniques. This study demonstrated that soybean breeders could
implement E-S algorithm using either the full or selected spectra reflectance to select the
high-yielding soybean genotypes, among a large number of genotypes, at early growth
stages.
1.5 Introduction
The world population is projected to exceed nine billion individuals by 2050, which
will require significant improvements in the yield of major crops that contribute to global
food security (Tilman et al., 2009; Foley et al., 2011; Alexandratos and Bruinsma, 2012;
Dubey et al., 2019). Increasing the yield is the primary goal of most plant breeding
programs for major crops, such as soybean (Glycine max), which is the world’s most
widely grown leguminous crop and an important source of protein and oil for food and
feed (Hartman et al., 2011). In the area of plant breeding, however, measuring primary
traits, such as yield, which is under influenced by a combination of quantitative and
qualitative traits, in large breeding populations consisting of several thousand genotypes
is time and labor-consuming (Araus and Cairns, 2014; Cai et al., 2016; Xiong et al., 2018).
Breeding for yield is known as a highly complex and non-linear process due to the genetic
45
and environmental factors (Collins et al., 2008). Therefore, breeding approaches that are
established based on secondary traits (e.g., yield component traits and reflectance
bands), which are strongly correlated with the primary trait, enable plant breeders to
efficiently recognize promising lines at early growth stages (Ma et al., 2001; Jin et al.,
2010; Montesinos-López et al., 2017b).
The combination of high-throughput genotyping and phenotyping technologies have
enabled plant breeders to make their early growth stage selections more accurate while
it reduced the evaluation time and cost in their breeding programs (Rutkoski et al., 2016a).
Although there has been significant progress in high-throughput genotyping in recent
years with a direct impact on current plant breeding challenges (Araus and Cairns, 2014;
Tardieu et al., 2017; Araus et al., 2018), acquisition of high-throughput field phenotyping
is still a bottleneck in most breeding programs (Furbank and Tester, 2011; Araus et al.,
2018).
Remote sensing of spectral reflectance is considered as an efficient high-throughput
phenotyping tool (Araus and Cairns, 2014; Tardieu et al., 2017), which aims to measure
the spectral reflectance efficiently at several plant growth and development stages in
large breeding populations (Rutkoski et al., 2016a). It is well documented that the spectral
properties are genotype-specific and influenced by the anatomy, morphology, and
physiology of plants (Kycko et al., 2018; Schweiger et al., 2018) and, therefore, can be
used for screening plant genotypes with different agronomic potentials.
46
Analyzing large datasets consisting of spectral reflectance data requires intensive
computational and statistical analyses, which is still challenging in many plant breeding
programs (Lopez-Cruz et al., 2020). Nowadays, machine learning algorithms have drawn
attention from researchers to develop model-based breeding methods that can improve
the efficiency of breeding processes (Hesami et al., 2020b). Recently, one of the most
common artificial neural networks (ANNs), the Multilayer Perceptron (MLP) developed by
Pal and Mitra (1992a), has been broadly used for modeling and predicting complex traits,
such as yield, in different breeding programs (Geetha, 2020). MLP can be considered as
a non-linear computational method employed for various tasks such as classification and
regression of complex systems (Chen and Wang, 2020; Hesami et al., 2020d). This
algorithm is able to detect the connection and relationship between the input and output
(target) variables and quantify the inherent knowledge existing in the datasets (Ghorbani
et al., 2016; Hesami et al., 2020d). This algorithm includes several highly interconnected
processing neurons that can be used in parallel to detect a solution for a specific problem
(Ghorbani et al., 2016; Geetha, 2020). Support vector machines (SVMs), developed by
Vapnik (2000), are known as one of the powerful and easy to use machine learning
algorithms that can recognize patterns and behavior of non-linear relationships (Auria and
Moro, 2008; Su et al., 2017). Some of the advantages of SVMs over MLP are linked to
the complexity of the networks. SVMs usually use a large number of learning problem
formulations leading to solving a quadratic optimization problem (Feng et al., 2020b;
Hesami et al., 2020d). In theory, SVM has to be better performance because of using
47
structural risk minimization inductive principles rather than the empirical risk minimization
inductive principle (Belayneh et al., 2014). In addition to MLP and SVM, Random Forest
(RF - (Breiman, 2001)) is another method for data modeling with a computational efficient
training phase and very high generalization accuracy. RF has been broadly used in areas
such as object recognition (Lepetit et al., 2005), skin detection (Khan et al., 2010), plant
phenomics (Falk et al., 2020), and genomics (Mokry et al., 2013).
Machine learning algorithms are subject to overfitting, mainly because of limited
training data and dependent on single predictive models (Ali et al., 2014; Feng et al.,
2020b). Ensemble techniques, in which a group of algorithms are exploited to combine
all the possible predictions for the ultimate prediction used to address this shortage
(Dietterich, 2000). By using ensemble models, the predictive performances were
improved for yield prediction in Alfalfa (Feng et al., 2020b), Nicosia wastewater treatment
plant (Nourani et al., 2018), and plant lncRNAs (Simopoulos et al., 2018). Boosting,
bagging, and stacking are three of the most commonly used ensemble models (Dietterich,
2000; Feng et al., 2020b). The bagging method was first introduced by Breiman (1996)
as a variance reduction approach for different algorithms such as decision trees or other
algorithms that employed variable selection and fitting in a linear model (Galar et al.,
2011). Boosting algorithms have been introduced by Schapire (1999) to serve as the
alternative for the bagging method (Drucker and Cortes, 1996). Unlike bagging methods,
which are parallel ensemble techniques, boosting methods are known as sequential
ensemble techniques of base models by exploiting the dependencies of each algorithm
48
(Dietterich, 2000; Feng et al., 2020b). Many studies reported the successfulness of using
bagging-RF and stochastic gradient boosting in predicting the yield of different crops (Pal,
2007; Gandhi et al., 2016; Aghighi et al., 2018; Zhang et al., 2019b). Bagging and
boosting ensemble techniques commonly combine homogeneous algorithms for
interpretation, while stacking methods tend to use heterogeneous algorithms and adjust
the difference between them to increase precision (Dietterich, 2000; Zhou, 2012; Feng et
al., 2020b).
The successful use of machine learning algorithms for predicting the performance
of different agronomic traits, including yield, are reported in Alfalfa (Feng et al., 2020b),
Senecio species (Carvalho et al., 2013), grassland (Feilhauer et al., 2017; Rocha et al.,
2018), and chrysanthemum (Hesami et al., 2019b). However, the application of machine
learning algorithms for predicting soybean yield from hyperspectral reflectance data is still
unexplored and required comprehensive studies. Ensemble-based methods have
successfully been applied to improve the prediction accuracies in artificial intelligence and
computer vision (Ali et al., 2014; Feng et al., 2018; Ju et al., 2018; Feng et al., 2020b)
and, therefore, they may improve the accuracy of the yield prediction in this study. Thus,
the main objectives of this study are: (1) to investigate the potential use of hyperspectral
reflectance for predicting soybean yield, (2) to identify appropriate time-point of soybean
growth stages for collecting hyperspectral reflectance to maximize yield prediction
accuracy, and (3) to have a comparative study of individual and ensemble machine
learning algorithms to improve the accuracy of predicting yield. The results of this study
49
might help soybean breeders to increase the efficiency of selecting superior lines by
estimating the final yield at early growth stages using spectral reflectance combined with
machine learning approaches.
1.6 Materials and Methods
1.6.1 Experimental locations and plant materials
The research was conducted at the University of Guelph, Ridgetown Campus, in
2018 and 2019. A panel of 250 soybean genotypes was grown under field condition at
two locations: Ridgetown (42°27'14.8"N 81°52'48.0"W, 200m above sea level) and
Palmyra (42°25'50.1"N 81°45'06.9"W, 195 m above sea level), in Ontario, Canada during
two consecutive growing seasons in 2018 and 2019.
The soybean genotypes used in this study were the core germplasms of the
soybean breeding program at the University of Guelph, Ridgetown that have been
collected in the past 35 years and used for genetic studies and cultivar developments.
The ID, name and pedigree of the tested genotypes were presented in Table 2.1. The
experiments were conducted using randomized complete block designs (RCBD) with two
replications in four environments (two locations × two years). Overall, there were 500
soybean plots per environment and 1000 soybean plots per year. In order to reduce the
possible spatial variability in the field, each experiment was analyzed by nearest-neighbor
analysis (NNA) as one of the error control strategies by using double covariate analysis
50
(Stroup and Mulitze, 1991; Bowley, 1999; Katsileros et al., 2015a). Each plot consisted
of five rows, each 4.2 m long with a row spacing of 43 cm. The seeding rate was 57 seeds
per m2. At the end of the season, the three inside rows were machine harvested for
estimating total seed yields (T ha-1).
1.6.2 Phenotypic evaluations
1.6.2.1 Yield collection
Soybean seed yield (T ha-1) was measured using three out of five harvested rows
for each plot and adjusted to a 13% moisture level and maturity data. The best linear
unbiased prediction (BLUP) as a mixed model was used to calculate the average seed
yield production for each soybean genotype across different environments (Goldberger,
1962).
1.6.2.2 Hyperspectral reflectance data collection
In this study, the focus was on the spectral reflectance bands that are typically
classified as the visible (VIS) and near-infrared (NIR) spectral components (Albetis et al.,
2017). Canopy hyperspectral reflectance measurements were collected during the
soybean growth and development stages at R4, where pods are 1.91 cm long at one of
the four uppermost nodes, and R5, where seeds are 0.31 cm long in pods at one of the
four uppermost nodes (Pedersen et al., 2004). The data were collected was based on the
51
variations that were existed at R4 and R5 among genotypes across different
environments.
Each soybean genotype's hyperspectral reflectance properties were collected using
a UniSpec-DC Spectral Analysis System (PP Systems International, Inc. 110 Haverhill
Road, Suite 301 Amesbury, MA 01913, USA). The machine covers 250 reflectance bands
between 350 nm and 1100 nm with a bandwidth of 3 nm. The field-of-view of the sensor
was approximately 25° and covered a sample area of 0.25 m2. Dark reference was used
for calibrating the dual channels, and Spectralon panels were used to characterize
incoming solar radiation. For each plot, three measurements were recorded at the same
spot in order to reduce the noise, and their average, calculated by the BLUP model, was
used as the reflectance band datapoint. All of the measurements were performed as close
to solar noon as possible – the data for each stage were collected in one day, from 11:00
AM to 2:00 PM, to minimize the signal noise associated with the environment. The
hyperspectral reflectance of each location measured at one day with one device.
1.6.3 Data pre-processing and statistical analyses
The existence of noise during hyperspectral reflectance measurement is inevitable,
typically caused by sensors and electronic fluctuations (Ozaki et al., 2006). Therefore, it
would be critical to have a pre-processing step for the collected data in order to increase
the accuracy of the study. The hyperspectral data and yield of 250 soybean genotypes
were pre-processed such as normalizing, cleaning, and formatting using the R software
52
(version 3.6.1) to remove potential noises that randomly occur across the whole spectra
resulting in misinterpretations. After checking the quality of reflectance bands and
detecting outliers by using principal component analysis (PCA) for each genotype, 245
genotypes were selected for further analyses (Serneels and Verdonck, 2008). As a result
of sensor-specific artifacts, reflectance bands at the two edges of the hyperspectral
reflectance spectrum, 350-395 nm and 1005-1100 nm were removed from the original
data because of the missing data and noises. The collected contiguous hyperspectral
reflectance data was also reduced from 395-1005 nm with a 3 nm interval to a 10 nm
interval leading to a total of 62 variables. Data scaling and centering were applied in order
to improve reflectance properties in the pre-processing and the pre-treatment steps
(Rossel, 2008). For each reflectance band variable, the Savitzky–Golay filter was applied
for improving the signal-to-noise ratio (Savitzky and Golay, 1964).
As shown in Figure 2.1, the measured soybean yield was divided into four classes
with equal numbers (~) of data points in ascending order: Low (0–24.99% of total yield),
Medium-Low (25–49.99% of total yield), Medium-High (50-74.99% of total yield), and
High (75 –100% of total yield) yield. The way that the yield was classified was based on
ordering genotypes from low to high yield and select 25% of genotypes that had the lowest
yield, 25% medium-low, medium-high, and high. While 62 reflectance bands were
considered as input variables, the classified yield was chosen as the output variable.
Overall, the following statistical model was used in this study:
53
𝑌𝑖𝑗 = 𝜇 + 𝑓(𝑠) + 𝐺𝑖 + 𝐸𝑗 + 𝐺𝐸𝑖𝑗 + 𝜀𝑖𝑗 , 𝑖 = 1, … , 𝑘; 𝑗 = 1, … , 𝑛 (Eq 2.1)
Where Yij stands for the trait of interest (soybean seed yield and hyperspectral
reflectance) as a function of an intercept μ, f(s) stands for the spatial covariate, Gi is the
random genotype effect, Ej stands for the fixed environment effect, GEij is the
genotype×environment interaction effect, and εij stands for the residual effect.
1.6.4 Variable selection
Feature or variable selection is usually applied before developing the machine
learning algorithms for reducing the data dimensionality, specifically in the small training
datasets. One of the common approaches for variable selection is the Recursive Feature
Elimination (RFE) approach, which is easy to configure and effectively select the most
relevant variables that predict the output (Chen and Jeong, 2007). Therefore, the RFE
was run to indicate the initial variable importance scores and eliminate the reflectance
band variables with the lowest importance score. Afterward, the process was recursively
repeated until the ranking was indicated for all the reflectance bands. The package caret
(Kuhn, 2008) in R software version 3.6.1 was used for running RFE.
1.6.5 Data-driven modeling
Three of the most commonly used algorithms in the literature, Multilayer Perceptron
(MLP), the Support Vector Machine (SVM), and Random Forest (RF) (Filippi and Jensen,
2006; Chen et al., 2007; De Castro et al., 2012; Makantasis et al., 2015; Šestak et al.,
54
2019; Zhang et al., 2019a), were chosen and used for predicting the soybean yield. Figure
2.2-A shows the MLP algorithm including an input layer, one or more hidden layers, and
an output layer of completely interconnected neurons. Each neuron unit produces an
output based on a sigmoid function derived from a linear combination of outputs from a
previous layer (Wang et al., 2009). SVM (Figure 2.2-B) is a set of related supervised
learning methods that can recognize patterns used for classification analyses (Suykens
and Vandewalle, 1999; Shao et al., 2012). The objective of SVM is to use hyperplanes
for determining the optimal separation of yield classes. The Random Forest (RF)
approach generates a series of trees representing a subset n of independent
observations (Figure 2.2-C). A detailed description of these machine learning algorithms
can be found in Taillardat et al. (2016) and Meinshausen (2006). All of the relevant
parameters in each machine learning algorithm was optimized based on the input
variables.
We employed an ensemble method based on a stacking strategy (E-S) to improve
the prediction performance. The results from individual algorithms were collected and
combined together via the stacking procedure described in Dietterich (2000), where an
algorithm with the highest accuracy performance was selected as the metaClassifier for
this ensemble model. The Weka software version 3.9.4 (Hall et al., 2009) was used for
running all machine learning algorithms and the ensemble method.
55
1.6.6 Quantification of machine learning performance
In this study, the five-fold cross-validation strategy (Siegmann and Jarmer, 2015)
with ten repetitions was used to measure the classification quality of all the tested ML
algorithms (Figure 2.3). In order to evaluate each algorithm, the values of precision (Eq
2.2), recall (Eq 2.3) as a measure of sensitivity, F-Measure (Eq 2.4), and Matthews
Correlation Coefficient (MCC, Eq 2.5) for validation dataset were measured using the
following formulas:
Precision =𝑇𝑃
𝑇𝑃+𝐹𝑃 (Eq 2.2)
Recall =𝑇𝑃
𝑇𝑃+𝐹𝑁 (Eq 2.3)
F − Measure = 2 ×Precision×Recall
Precision+Recall (Eq 2.4)
MCC =𝑇𝑃×𝑇𝑁−𝐹𝑃×𝐹𝑁
√(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝐹𝑁) (Eq 2.5)
where TP stands for the number of true positives, TN is the number of true
negatives, FP stands for the number of false positives, and FN is the number of false
negatives.
56
1.6.7 Visualizing and analyzing
The Microsoft Excel software (2016), ggplot2 (Wickham, 2011), and ggvis (Dennis,
2016) packages in the R software version 3.6.1 were used to conduct statistical analyses
such as BLUP calculation, preprocessing data, and visualize the results.
1.7 Results
1.7.1 Yield Statistics and Spectral Profiles
In the current study, the average yield of 245 soybean genotypes, evaluated in four
environments, ranged from 2.58 to 5.71 t ha-1 with a mean and standard deviation of 4.22
and 0.57 t ha-1, respectively. The variation of yield among genotypes across different
environment was shown in Figure 2.4. Also, the ANOVA table for yield was presented in
Table 2.2. The heritability of yield was estimated as 0.24. The minimum, mean, and
maximum values of each reflectance bands evaluated for all the genotypes across the
four environments at the R4 and R5 growth stages are reported in Figure 2.5. At both
growth stages, while the reflectance values showed small ranges of variation among the
genotypes between 395 nm and 695 nm, the bands greater than 705 nm showed large
variations within the population.
57
1.7.2 Variable selection
The importance values of all 62 reflectance band variables for predicting yield were
estimated using the RFE strategy for the R4 (Table 2.3) and R5 (Table 2.4) growth stages.
For the R4 growth stage, the 1005 nm and the 605 nm bands had the highest and lowest
importance values (%) for classifying the soybean yield, respectively. Based on RFE
analysis, 56 of the reflectance bands were selected for training the algorithm, as shown
in Figures 2.6-A. At the R5 growth stage, the highest and lowest importance values (%)
were found at 395 nm and the 725 nm bands, respectively. Out of 62 reflectance bands,
21 reflectance bands were selected to train the algorithms based on RFE strategy, which
were considered selected variables (-VS) for further analyses. Among the 21 selected
reflectance bands, three bands were in the violet, six in the blue, two in the green, eight
in the red, and two were in the near-infrared (NIR) regions of the spectrum (Figure 2.6-
B). By using RFE for the R4 growth stage dataset, the top five high importance reflectance
bands were located in the violet and NIR regions of the spectra. However, for R5, the
violet and red regions had the top five high importance reflectance bands (Table 2.2 and
2.3). The violet region, specifically the 395 nm band, had the highest importance values
in both growth stages. The plotting of the soybean yield vs. reflectance values at 395 nm
(R5 growth stage) illustrated that the values for 395 nm in the high yielding class ranged
from 0.009 to 0.016 which lower than values for the low yielding class, ranged from 0.020
to 0.029 (Figure 2.7). The difference between the reflectance values of high and low
yielding classes was statistically significant at the significance level of 0.05 (data were not
58
shown). Among all the tested bands, the 395 nm band measured at R5 was considered
as the best solo reflectance band for discriminating soybeans for their yield potential.
1.7.3 Growth stage comparison
In order to investigate which of the growth stages is better for collecting reflectance
data and predicting the soybean yield, the reflectance bands collected at each soybean
growth stage were analyzed using the three machine learning algorithms. The average
classification accuracy for validation dataset ranged between 12% and 43% using the R4
data and between 12% and 99% using the R5 data, which indicated that the R5 soybean
growth stage is, in general, a better stage for collecting reflectance data if the goal is to
predict the yield (Figure 2.8). Therefore, R5 was selected for further machine learning
algorithm analyses. The results of individual and ensemble machine learning algorithms
using R4 data are available in Figure 2.8 and Figure 2.9.
1.7.4 Comparative analysis of the developed models
All three algorithms (RF, MLP, and SVM), as well as the E-S model, were trained
using both full (62 bands) and selected (21 bands) variables at R5, and the summaries of
the confusion matrices were presented in Table 2.5. Regarding the comparative analyses
of individual algorithms using all variables, RF, MLP, and SVM had the highest to lowest
MCC values equal to 0.84, 0.76, and 0.66, respectively (Figure 2.10-A). For the selected
variables, the MCC values for RF and MLP declined to 0.80 and 0.73, respectively, while
59
the value for SVM slightly increased to 0.73. The E-S method outperformed all the
individual algorithms obtaining an MCC value of 0.93, using all variables, and 0.87, using
selected variables (Figure 2.10-A). In general, among all the individual tested algorithms,
the RF algorithm had the highest performance with the values of 84% and 80% yield
classification accuracy using all variables and selected variables, respectively.
When using full variables, the precision values for RF, MLP, and SVM were 0.91,
0.83, and 0.82, respectively. However, by using selected variables, the precision values
for RF and MLP declined to 0.87 and 0.78, respectively, while the SVM performance was
improved (0.87) when compared against all variables (Figure 2.10-B). The E-S model had
a precision of 0.96 using all variables and 0.90 using selected variables. Using all
variables, the highest recall value was obtained for RF with a value of 0.84, followed by
MLP and SVM with the values of 0.83 and 0.68, respectively. However, the recall value
of SVM increased to 0.72 using selected variables. The recall values of RF and MLP
declined when selected variables were used (Figure 2.10-C). E-S had the highest recall
values, with 0.94 and 0.90 for full and selected variables, respectively.
To have a better interpretation of precision and recall values, the F-Measure was
evaluated for each and every algorithm. Using all variables, the F-Measures of RF, MLP,
and SVM were estimated to be 0.87, 0.81, and 0.71, respectively (Figure 2.10-D). F-
Measure values were decreased for RF (0.84) and MLP (0.80) using selected variables.
However, for SVM algorithm, the F-Measure value was increased to 0.77 by using
60
selected variables. The E-S algorithm outperformed all the individual machine learning
algorithms by having an F-Measure value of 0.94, using all variables, and 0.90, using
selected variables.
1.8 Discussion
One of the objectives of this study was to find the best growth stage for collecting
reflectance data suitable for predicting soybean yields. In this study, the hyperspectral
reflectance data were collected at the reproductive stages of R4 and R5, in which pods
and seeds are developed. R4 and R5 are known as critical growth and development
stages in soybean since stresses can impose significant impacts on the yield at these
stages and, therefore, soybean genotypes with different levels of tolerance to stresses
can be discriminated from one another at these stages (Sweeney et al., 2003). For
example, the results of a study by Eck (1987) showed that imposing soybeans to water
deficit stress during the R1 to R3 growth stages reduce seed yields up to 9-13%.
However, imposing the same soybeans to water deficit stress during R4 to R5 reduced
seed yields up to 46%. Water deficit stress can less influence the total seed yield when
occurring anytime beyond the R5 growth and development stage. Therefore, measuring
hyperspectral reflectance at R4 and R5 would be more informative for predicting the
soybean yield classes since the final yield production has been to some extent
established at these two stages for all the genotypes. Our results indicated that R5 is a
better stage to measure reflectance bands for predicting the yield.
61
Ma et al. (1996) reported significant correlations between leaf photosynthetic rates
and leaf greenness at R4 and R5, while this correlation was not significant at R6. In
soybean, the leaf photosynthetic rate can be changed significantly during the growth
stages that, in turn, can empower different genotypes to recover the yield losses caused
by temporary environmental stresses (Ferris et al., 1998; Siebers et al., 2015). Studies
showed that environmental stresses at R5 can damage the soybean yield greater than
that at R4 (Fehr et al., 1981) since the plants have less time to recover for yield before
physiological maturity. It can be hypothesized that predicting yield of genotypes with
different genetic potential by using reflectance bands that are measured at R5 is more
reliable since the final yield productions have already been established, to some extent,
for all the genotypes. The current study showed that the reflectance bands collected at
R5 are more reliable and informative for predicting yield than the data collected at R4
because of the level of accuracy.
Several studies reported the strong correlation between reflectance bands and yield
in different crop plants such as alfalfa (Kayad et al., 2016; Feng et al., 2020b), wheat
(Prey and Schmidhalter, 2019), maize (Lane et al., 2020), rice (Wan et al., 2020), and
sugarcane (Verma et al., 2020). The visible reflectance bands can be split into three main
categories, red (650-700 nm), green (550 nm), and violate-blue (390 – 499 nm)
(Hennessy et al., 2020). Most studies were emphasized the importance of red spectral
bands, or combined used of red and red edge bands as one solid index in predicting the
total yield (Jolly et al., 2005; Filippa et al., 2018a; Lykhovyd, 2020; Phan et al., 2020;
62
Tiwari and Shukla, 2020). In this study, we identified the 395 nm as the high ranked
reflectance band in classifying the soybean seed yield. The waveband centered at 395
nm belongs to the blue region of the spectral which is correlated with carotenoid,
anthocyanins, and chlorophyll absorptions (Merzlyak et al., 2003; Richter et al., 2016).
Carotenoid plays a pivotal role in discrimination of senescent leaves (Richter et al., 2016;
Hennessy et al., 2020). The importance of 395 nm band in soybean yield prediction at R5
growth stage can refer to the fact that soybean at R5 growth stage initiates the
senescence and decay of chlorophyll resulting in better discrimination of the genotypes
with different capacities. However, there is no report on the solid effect of the 395 nm
reflectance band in the physiological process of soybean or any other plants.
In order to have accurate yield prediction and avoid model overfitting, machine
learning algorithms may benefit from using a variable selection process to reduce the
dimensionality of the data to an appropriate level (Hennessy et al., 2020). Existing
variable selection methods can be categorized based on their applications, complexities,
and accuracy (Zheng et al., 2020). One of the most commonly used variable selection
methods is the RFE approach that provides an acceptable performance with moderate
computational exertions (Guyon et al., 2002; Granitto et al., 2006). The successful use of
RFE to reduce the number of input variables has been reported in many studies (Granitto
et al., 2006; Chen and Jeong, 2007; Feng et al., 2020b). The efficiency of using selected
variables for predicting classified soybean yield over full reflectance band variables was
evaluated using the RFE method. Using RFE method might decrease the value of the
63
parameters such as precision, recall, MCC, and F-Measure to avoid overfitting (Loughrey
and Cunningham, 2004). This is a small price to pay, especially if the decrease in
performance is not significant. Among all the tested individual machine learning
algorithms, RF had the highest performance using either full or selected variables. This
high performance may come from the nature of the RF algorithm, in which trees are
trained using multiple random subsamples of the original dataset. This feature gives RF
this ability to generate better and more stable predictions for new instances not
necessarily included in the training dataset (Liaw and Wiener, 2002).
MLP was another machine learning algorithm that was exploited in this study. MLP
was previously applied in different areas such as weed science (Tamouridou et al., 2017)
or drought tolerance (Etminan et al., 2019), but not in soybean for yield prediction. This
study found MLP to be the second-best machine learning algorithm for predicting the
soybean yield. Previous studies reported a high likelihood of overfitting for neural network
algorithms (Lawrence and Giles, 2000; Murakoshi, 2005). For MLP, common parameters
such as the number of hidden layers, the number of neurons in each layer, or training
time can be used to control overfitting; however, the degree of overfitting would vary
throughout the input variables (Lawrence and Giles, 2000).
SVM is also one of the most common machine learning algorithms that have been
broadly used in different areas such as plant tissue culture (Hesami and Jones, 2020b),
image classification (Lin et al., 2011), genes classification (Duan et al., 2005), and drug
64
disambiguation (Björne et al., 2013). SVM has been used when scientists had to deal with
large numbers of features and high sparsity (Nguyen and De la Torre, 2010). Although
the prediction accuracy of the SVM algorithm was lower than the values for RF and MLP
in this study, its performance was slightly increased when the selected variables were
used. An increase in SVM performance using selected variables was also reported in
previous studies (Su and Yang, 2008; Tan et al., 2010; Alirezanejad et al., 2020). It might
be due to this fact that selecting relevant variables can improve the performance of SVM
through ameliorating its feature interpretability, computational efficiency, and
generalization performance (Nguyen and De la Torre, 2010; Roy et al., 2015).
In order to see we can improve the prediction accuracy in this study through the
combined use of the machine learning algorithms, RF, MLP, and SVM were used in
constructing E-S, and RF was chosen as the metaClassifier for this ensemble algorithm.
By using the E-S approach, we improved the prediction accuracy using either full or
selected variables. A successful use of the E-S method has recently been reported for
predicting the yield in alfalfa (Feng et al., 2020b). When using the E-S approach, it is
necessary to include self-sufficient, independent, and diverse ML algorithms in the
analyses (Araya et al., 2017; Feng et al., 2020b), which have a minimum dependency
from one another and sufficient powers to predict the dependent variable, soybean yield
classes in this study (Araya et al., 2017; Feng et al., 2020b). The above criteria are
important to be considered when individual ML algorithms are selected to combine in a
given E-S analysis. In this study, RF, MLP, and SVM are selected as individual algorithms
65
to be used in the E-S analyses because of their independent prediction methods as well
as having different training approaches. By using the E-S approach, the prediction
accuracy increased to 0.93, using all variables, and 0.87, using selected variables
showing the success of using E-S as one of the ensemble techniques.
1.9 Acknowledgments
The authors would like to acknowledge the technical assistance of Dr. Sepideh
Torabi, Mr. Bryan Stirling, Mr. John Kobler, Mr. Robert Brandt, and all the soybean
breeding crew at the University of Guelph, Ridgetown Campus.
66
Table 0.1. The ID, name, and pedigree of the tested genotypes.
ID Name Pedigree
G1 AR 03 JewelxB152
G2 AR 04 Elignxa81-151014
G3 AR 05 Elgin87 xA3127
G4 AR 06 KG 60xA2943
G5 AR 07 (Conrad )xRact8502
G6 AR 08 T8508xOac86-07
G7 AR 09 M83-91x(t8508)
G8 AR 10 T8508 x9292
G9 AR 11 Rcat 8802 xS26-06
G10 AR 12 Rcat8703xs26-06
G11 AR 13 RC8802xANGORA
G12 AR 15 TABBYxSTURDY
G13 AR 02 T8112x(B152)
G14 1983.40.001 B152x(Hw8039)
G15 1990.101.001 9303xT8505
G16 1990.119.001 TABBYxSTURDY
G17 1992.103.001 HS88-4907xRCAT9108
G18 1992.84.001 AP1989xCALICO
G19 1994.49.123 A2615xS2492
G20 1994.52.834 S2492xBOBCAT
G21 1994.53.164 S2492xOACBAYFIELD
G22 1994.68.102 U91-2527xS2492
G23 1994.76.336 BL-JACK21xCALICO
G24 1995.74.256 S2492xYB30J
G25 1995.94.653 J251xRcat9404
G26 1996.113.656 S14H4xIvory
G27 1997.132.120 Rcat9507xLegacy 90a.03
G28 1997.64.87 sw3308xs1520
G29 1998.102.86 Ivoryx92B 91
G30 2000.144.02 IA1008 x LN955414
G31 2000.144.175 ph9718xivory
G32 2001.138.2.13 IA1008xRcat9910(Pro3002)
G33 2001.178.12 RG9xRcat9905
G34 2001.196.1.18 IA1008xRcat9910(pro3002
G35 2002.188.14.01 OackentxPS73
G36 2002.196.31.1 Pro30-05xOackent
67
G37 2002.218.24.01 pro3170xpro3005
G38 2003.165.14.26 Pro30-02xRcat0204
G39 2003.168.19.05 harwichxrcat2004
G40 2004.173.001 colbyxKatrina
G41 2004.181.156 IA2068xKatrina
G42 2004.193.3.02 ColbyxPinehurst
G43 2004.199.18.01 PinhurstxA01-4509031
G44 2004.202.19.04 OackentxPinehurst
G45 2004.230.15.01 PinehurstxRuthven
G46 2005.163.23.4 dh410xpinehurst
G47 2005.164.001 T0142xrc0412scn
G48 2005.167.9.06 OacKentxDH410
G49 2005.170.001 Katrinaxuo1290300
G50 2005.173.16.90 Rcat9507xKatrina
G51 2005.183.9.06 RuthvenxA03-741001
G52 2005.185.6.01 Rcat0412xA03-741001
G53 2005.207.36.06 Pinehurstxpin*Champion*Palme
G54 2006.167.47 A04-54307xM98-227065
G55 2006.169.001 M98227065xcobly
G56 2006.177.001 s12a5xm9822065
G57 2006.188.662 LD01-7323scnxCobly
G58 2006.190.792 PinhurstxLD01-7323scn
G59 2006.197.1062 LD01-7323scnxLG00-6925
G60 2006.198.1143 LD01-7323scnxMn0902scn
G61 2006.201.1381 BelmontxLD01-7323scn
G62 2007.200.566 katrinaxa04543037
G63 2007.201.14 m00378032xkatrina
G64 2007.206.40.01 a05112034xkatrina
G65 2007.208.595 hdgoshenxa04543037
G66 2007.211.867 sv9003mfa/6xhdgoshen
G67 2008.160.145 StarfieldxSc2307
G68 2008.160.159 StarfieldxSc2307
G69 2008.160.85 StarfieldxSc2307
G70 2008.169.519 Dh410xRcat0702
G71 2009.131.368 GoshenxDh410
G72 2009.139.398 DH410x100Rc13iso
G73 2009.140.687 S23T5xrcat0802
G74 2009.142.713 S23T5xDH410
68
G75 2009.142.736 S23T5xDh410
G76 2009.144.776 S23t5xOacmarvel
G77 2009.144.784 s23t5xMarvel
G78 2009.144.791 s23t5xmarvel
G79 2009.153.10.01 DH410x100rc13iso
G80 2009.153.10.02 DH410x100rc13iso
G81 2013.122.3283 rcat0908xs15-c2
G82 2009.176.3.01 s23t5x100rc13isoflavon
G83 2010.121.173 Rcat0905nxMerion
G84 2010.122.313 MerionxSc4009n
G85 2010.122.394 MerionxS23T5
G86 2010.125.621 OacMerionXS23T5
G87 2010.126.508 MerionxS18R6
G88 2010.127.568 Rcat0905nxS18R6
G89 2010.128.5.01 Sc4009xSc2407
G90 2010.137.1185 NaturexS23T5
G91 2010.141.1727 Sc4009x0x902
G92 2010.145.2020 s18R6xOx902
G93 2010.146.13.04 Rcat0905nxOx802
G94 2010.146.22.02 Rcat0905nxOx802
G95 2010.149.2191 S23T5xOx802
G96 2010.149.2193 S23T5xOx802
G97 2010.155.771 S23T5xOac08-22C
G98 2010.162.07.02 S18R6xRcat0902
G99 2010.162.16.02 S18R6xRcat0902
G100 2010.162.2.02 S18R6xRcat0902
G101 2010.162.21.03 S18R6xRcat0902
G102 2010.162.8.05 S18R6xRcat0902
G103 2011.121.44 ThamesxMfa0931L
G104 2011.123.04 ThamesDH4202
G105 2011.123.105 ThamesxDh4202
G106 2011.123.109 ThamesxDh4202
G107 2011.132.608 ThamesxRc0901n
G108 2011.133.633 rc1002nxDFh4173
G109 2011.134.10.03 Rc0901nxSc4110
G110 2011.134.3.01 Rcv0901nxSc4110light
G111 2011.139.828 S18r6xOacHuron
G112 2011.140.5.02 HdgoshenxS12r6
69
G113 2011.140.5.03 hdcGoshenxs18r6
G114 2011.140.696 OacMerionxS18R6
G115 2011.140.700 HDCGoshen xS18R6
G116 2011.147.1183 RC1002xDh4173
G117 2011.151.801 rc1002nxs18r6
G118 2011.157.20.07 RCAT 1004 x DH 4202
G119 2011.157.20.19 RCAT 1004 x DH 4202
G120 2011.157.22.11 RCAT 1004 x DH 4202
G121 2011.157.22.22 RCAT 1004 x DH 4202
G122 2011.157.22.19 RCAT 1004 x DH 4202
G123 2011.157.22.23 RCAT 1004 x DH 4202
G124 2011.157.23.02 RCAT 1004 x DH 4202
G125 2011.157.23.09 RCAT 1004 x DH 4202
G126 2011.157.23.18 RCAT 1004 x DH 4202
G127 2011.161.1885 Rc1004nxOacHuron
G128 2011.185.3.01 DH410xAr0616508
G129 2012.121.71 RCAT0908 x XC2211
G130 2012.121.84 Thames x XC 2211
G131 2012.131.1227 MERSEA x RCAT 1003
G132 2012.139.2053 OAC BROOKE x XC2211
G133 2012.140.02 SC 3809 x SC 4110N
G134 2012.142.4 SC 3809 x S18-R6
G135 2012.142.2289 OAC BROOKE x RCAT 1003
G136 2012.142.2300 OAC BROOKE x S18-R6
G137 2012.142.2352 OAC BROOKE x S18-R6
G138 2012.142.2362 OAC BROOKE x RCAT 1003
G139 2012.144.2579 brookexu09129007
G140 2012.145.2700 OAC BROOKE x AR09191003
G141 2012.145.2889 brookexar09191003
G142 2012.150.3179 rcat1005xsc4110n
G143 2012.150.3207 rcat1005xsc4110n
G144 2012.152.2490 RCAT 1005 x S23-T5
G145 2012.152.3313 RCAT 1005 x S23-T5
G146 2012.155.3478 rcat1005xrcat1002
G147 2012.160.18 RCAT 1001 x S18-R6
G148 2012.157.3617 RCAT 1001 x XC2211
G149 2012.160.3833 RCAT 1001 x S18-R6
G150 2012.160.3848 RCAT 1001 x S18-R6
70
G151 2012.161.3906 RCAT 1001 x S23-T5
G152 2012.173.5556 kenjiangdou43 x XC2211
G153 2013.13.139.1302 SC 7512 x DH 4175
G154 2013.123.134 Rcat0908 x Rcat Dover
G155 2013.135.1015 Sc7512n x OX111
G156 2013.135.1039 Sc7512n x OX111
G157 2013.136.1078 Sc7512n xS23-T5
G158 2013.136.1105 Sc7512n xS23-T5
G159 2013.136.1108 Sc7512n xS23-T5
G160 2013.145.1704 SC8412n x DH4175
G161 2013.174.2655 S21c-3 x SC3809
G162 RG10 Ems1640
G163 RG14 mapleGlenxEms1640
G164 RG2 clipxEllp2
G165 RG23 Clp-1xEllp-3
G166 rg25 Clp-1xEllp-3
G167 RG26 (an145-66x4475)x
G168 rg27 Rcat9203x(an145-70x
G169 rg2a Clp-1xEllp-2
G170 e4457 EmsC1640
G171 e4475 Ems-C1640
G172 RG7 Ems-Eglin87
G173 RG8 Ems-C1640
G174 RG9 ems-Elgin87
G175 c1640 EmsCentury
G176 RG-ellp2 ems-Elgin87
G177 RG-ellp3
G178 P01
G179 P02
G180 P03
G181 P04
G182 P05
G183 P06
G184 P07
G185 P08
G186 P09
G187 P10
G188 P11
71
G189 P12
G190 P13
G191 P14
G192 P15
G193 12.172.5413 Kemnjiangdoou43xIa2022
G194 P16 yel n.l
G195 P17
G196 P18
G197 P19
G198 P20
G199 P21
G200 LG15-1913
G201 LG15-2951
G202 LG15-4124
G203 LG15-2959
G204 LG15-4075
G205 LG15-1578
G206 LG12-18131
G207 LG14-13101
G208 14.143.883 LD 073419 x RCAT 1004N
G209 13.139.1321 SC7512N x DH4175
G210 13.137.1183 SC7512N x S18-R6
G211 13.142.1431 SC8412N x SC23-T5
G212 13.137.1174 SC7512N x S18-R6
G213 13.137.1177 SC7512N x S18-R6
G214 13.136.1127 SC7512N x S23-T5
G215 13.139.1264 SC7512N x DH4175
G216 13.135.1038 SC7512N x OX111
G217 13.139.1270 SC7512N x DH4175
G218 13.131.790 RCAT5311N x SC3809
G219 13.135.1035 SC7512N x OX111
G220 13.137.1155 SC7512N x S18-R6
G221 13.136.1119 SC7512N x S23-T5
G222 13.130.670 RCAT5311N x S18-R6
G223 13.137.1160 SC7512N x S18-R6
G224 13.139.1302 SC7512N x DH4175
G225 13.174.2711 S21-C3 x SC3809
G226 13.125.390 RCAT0908 x DH4175
72
G227 13.141.1384 SC7512N x SC6011N
G228 13.132.833 RCAT5311N x DH4175
G229 13.178.3049 S21-C3 x RCAT1106N
G230 13.132.827 RCAT5311N x DH4175
G231 13.145.1708 SC8412N x DH4175
G232 13.170.2359 M05-357149 x RCAT1106N
G233 13.131.742 RCAT5311N x SC3809
G234 13.135.1016 SC7512N x OX111
G235 13.148.1910 SC8412N x RCAT1106N
G236 13.129.540 RCAT5311N x S23-T5
G237 13.123.145 RCAT0908 x S18-R6
G238 13.136.1091 SC7512N x S23-T5
G239 13.135.1077 SC7512N x 0X111
G240 13.137.1153 SC7512N x S18-R6
G241 13.129.576 RCAT5311N x S23-T5
G242 13.123.132 RCAT0908 x S18-R6
G243 13.134.938 RCAT5311N x RCAT1106N
G244 13.131.713 RCAT5311N x SC3809
G245 13.148.1875 SC8412N x RCAT1106N
G246 13.136.1113 SC7512N x S23-T5
G247 13.148.1913 SC8412N x RCAT1106N
G248 13.141.1341 SC7512N x SC6011N
G249 13.136.1081 SC7512N x S23-T5
G250 13.136.1117 SC7512N x S23-T5
73
Table 0.2 Combined analyses of variances for yield in the 250 soybean genotypes across four tested environments (Ridgetown2018, Ridgetown2019, Palmyra2018, and Palmyra2019).
Fixed
Effect Num DF Adj MS F-Value P-Value
Maturity 1 95069527 238.23 0.000 Environment 3 380200802 777.64 0.000
Random
Effect Adj MS P-Value
Replication 146345973 0.000 Genotype 1850778 0.000
Genotype × Environment 490786 0.001
74
Table 0.3 Reflectance band ranking using the Recursive Feature Elimination (RFE)
strategy at R4 soybean growth stage.
Reflectance band (nm) Ranking Reflectance band (nm) Ranking
1005 1 775 32
395 2 655 33
945 3 675 34
755 4 665 35
985 5 825 36
995 6 485 37
705 7 695 38
745 8 475 39
965 9 615 40
955 10 465 41
715 11 515 42
975 12 625 43
905 13 685 44
725 14 565 45
885 15 575 46
875 16 535 47
895 17 645 48
765 18 545 49
845 19 635 50
915 20 555 51
865 21 415 52
735 22 505 53
925 23 595 54
855 24 455 55
805 25 585 56
795 26 445 57
935 27 425 58
835 28 435 59
815 29 525 60
405 30 495 61
785 31 605 62
75
Table 0.4 Reflectance band ranking using the Recursive Feature Elimination (RFE)
strategy at R5 soybean growth stage.
Reflectance band (nm) Ranking Reflectance band (nm) Ranking
395 1 965 32
665 2 845 33
675 3 865 34
655 4 905 35
405 5 915 36
685 6 515 37
645 7 695 38
435 8 885 39
445 9 875 40
635 10 895 41
475 11 585 42
485 12 925 43
495 13 935 44
415 14 575 45
625 15 955 46
425 16 715 47
455 17 755 48
465 18 535 49
765 19 975 50
615 20 555 51
775 21 985 52
795 22 995 53
815 23 525 54
805 24 545 55
785 25 565 56
505 26 945 57
605 27 705 58
825 28 1005 59
855 29 745 60
595 30 735 61
835 31 725 62
76
Table 0.5 Confusion matrix based on the performance of RF, MLP, SVM, and Ensemble-Stacking model in predicting the
soybean yield using full and selected variables in R5 soybean growth stage.
Full Variable Selected Variable
Algorithm
Low Medium-Low
Medium-High
High Algorithm
Low Medium-Low
Medium-High
High
RF Low 58 3 0 0 RF Low 57 4 0 0 Medium-Low
6 53 2 0
Medium-Low
5 54 2 0
Medium-High
0 3 47 12
Medium-High
0 3 46 13
High 0 0 5 56
High 0 0 9 52
MLP Low 54 7 0 0 MLP Low 52 9 0 0 Medium-Low
2 56 3 0
Medium-Low
5 55 1 0
Medium-High
0 3 56 3
Medium-High
0 2 54 6
High 0 0 6 55
High 0 0 7 54
SVM Low 33 24 4 1 SVM Low 37 21 2 0 Medium-Low
7 39 15 0
Medium-Low
7 40 14 0
Medium-High
0 13 37 12
Medium-High
0 10 38 14
High 0 2 15 44
High 0 2 14 47
E-S Low 57 4 0 0 E-S Low 53 8 0 0 Medium-Low
4 57 0 0
Medium-Low
7 49 5 0
Medium-High
0 0 54 8
Medium-High
0 5 51 6
High 0 0 6 55
High 0 0 8 53
77
RF: Random Forest, MLP: Multilayer perceptron, SVM: Support Vector Machine, and E-S: Ensemble-stacking strategy. Number of true and false instances were provided for each yield class.
78
Figure 0.1 The distribution of soybean genotypes in each yield class. All classes had the
same number of datapoints.
79
Figure 0.2 A schematic representation of the machine learning algorithms used in this
study to classify the soybean yield using reflectance bands: A) Multilayer Perceptron, B)
Support Vector Machine, and D) Random Forest. In all the tested machine learning
algorithms, hyperspectral reflectance was considered as input and soybean yield was
considered as output variables. All the algorithms were optimized based on the used
parameters. Each tested machine learning algorithm has its own parameters and
objective functions that all were considered in the analysis.
80
Figure 0.3 The scheme of data collection and machine learning algorithm development
and validation. OP: Optimizing parameters; MLP: Multilayer Perceptron; SVM: Support
Vector Machine; RF: Random Forest; E-S: Ensemble- stacking strategy. Hyperspectral
reflectance was considered as input and yield was considered as output variable. Then,
the dataset divided into training and testing based on 5-fold cross validation strategy. All
the tested machine learning algorithms were trained based on training dataset and their
efficiencies were tested using testing dataset. The appropriate algorithm was selected
based on the comparative analysis of different machine learning algorithms in predicting
soybean seed yield using hyperspectral reflectance data.
81
Figure 0.4 The boxplot variation of yield across four environments. The current boxplot
is based on yield raw data of each genotype collected from the four tested environments.
82
Figure 0.5 The minimum, mean, and maximum values of each reflectance band from 390
nm to 1005 nm were measured for 245 soybean genotypes evaluated at (A) R4 and (B)
R5 growth stages at four different field environments. R4 growth stage is known as the
pod elongation stage and R5 is the beginning of the seed development stage.
83
Figure 0.6 The importance value of selected variables based on the Recursive Feature
Elimination (RFE) strategy for soybean reflectance bands measured at R4 (A) and R5 (B)
soybean growth stages.
85
Figure 0.8 The accuracy of RF, MLP, SVM, and E-S algorithms for predicting soybean
yield using full and RFE selected variables (-VS) measured at R4 (A) and R5 (B) soybean
growth stages in four environments. The mean performance was shown as × in each
figure. MLP: Multilayer Perceptron; SVM: Support Vector Machine; RF: Random Forest;
E-S: Ensemble- stacking strategy; RFE: Recursive Feature Elimination.
86
Figure 0.9 Performance and error evaluation of RF, MLP, SVM and E-S model in soybean
yield prediction using full and selected reflectance bands from R4 growth stage (VS:
variable selection).
87
Figure 0.10 Performance of RF, MLP, SVM, and the E-S algorithms for soybean yield
prediction using all and selected variables (-VS) from the R5 growth stage. The mean
performance is indicated with as × in each figure. MLP: Multilayer Perceptron; SVM:
Support Vector Machine; RF: Random Forest; E-S: Ensemble- stacking strategy; RFE:
Recursive Feature Elimination.
88
2 Chapter 3: Using Ensemble bagging and Depp Neural
Network Algorithms and Evolutionary Optimization
Algorithms for Estimating Soybean Yield and Fresh
Biomass Using Hyperspectral Vegetation Indices
Published in Remote Sensing Journal
Mohsen Yoosefzadeh-Najafabadi,1 Dan Tulpan,2 and Milad Eskandari1
1. Department of Plant Agriculture, University of Guelph, Guelph, ON N1G 2W1,
Canada
2. Department of Animal Bioscience, University of Guelph, Guelph, ON N1G 2W1,
Canada
*Corresponding author. Email: [email protected]
Yoosefzadeh-Najafabadi, M., Tulpan, D., & Eskandari, M. (2021). Using Hybrid Artificial Intelligence and Evolutionary Optimization Algorithms for Estimating Soybean Yield and Fresh Biomass Using Hyperspectral Vegetation Indices. Remote Sensing, 13(13), 2555.
89
2.1 Abstract
Recent advances in high-throughput field phenotyping combined with sophisticated big
data analysis methods have provided plant breeders with unprecedented tools for a better
pre-diction of important agronomic traits, such as yield and fresh biomass (FBIO), at early
growth stages. This study is aimed to demonstrate the potential use of 35 selected
hyperspectral vegetation indices (HVI), collected at the R5 growth stage, for predicting
soybean seed yield and FBIO. Two artificial intelligence algorithms, ensemble-bagging
(EB) and deep neural network (DNN), were used to predict soybean seed yield and FBIO
using HVI. Considering HVI as input variables, the coefficient of determination (R2) of
0.76 and 0.77 for yield and 0.91 and 0.89 for FBIO was obtained using DNN and EB,
respectively. In this study, we also used hybrid DNN-SPEA2 to estimate the optimum HVI
values in soybeans with maximized yield and FBIO productions. In addition, to identify
the most informative HVI in predicting yield and FBIO, the feature recursive elimination
method was used, and the top ranking HVI were determined to be associated with red,
670 nm, and near-infrared, 800 nm, regions. Overall, this study introduced hybrid DNN-
SPEA2 as a robust mathematical tool for optimizing and using informative HVI for
estimating soybean seed yield and FBIO at early growth stages, which can be employed
by soybean breeders for dis-criminating superior genotypes in large breeding populations.
90
2.2 Introduction
Soybean [Glycine max (L.) Merr.] is one of the most economically important crops in the
world that is used for food, feed, and industrial products (Colletti et al., 2020). Increasing
soybean yield has always been the main priority in breeding programs in order to keep
pace with the needs of a fast-growing global population (Dubey et al., 2019). In addition,
high attention has been paid to increase soybean fresh biomass (FBIO) due to its
biorefinery properties (De Pretto et al., 2018). It would be a significant investment for
farmers to make a profit not only from yield but also from FBIO. Therefore, the
simultaneous improvement in both yield and FBIO production in soybean seems to be
necessary to meet the various demands in the near future. Improving yield and biomass
that are considered as complex quantitative traits controlled by several genetic and
environmental factors (Collins et al., 2008) requires significant time and financial
investment in breeding programs. Pre-harvest prediction of soybean yield and FBIO will
enabled plant breeders to accurately select promising genotypes in large breeding
populations at early growth stage while reducing the cost and time in their cultivar
development programs (Rutkoski et al., 2016b).
Due to recent advances in the high throughput, non-destructive phenotyping tools such
as hyperspectral reflectance, the prediction of complex traits is now available at low cost
in early growth stages (Araus et al., 2018). Pre-harvest soybean yield and FBIO prediction
91
are not only important for global food security and industrial policy making but also for
field management practices (Pantazi et al., 2016; Maimaitijiang et al., 2020). Different
approaches have been applied for crop yield and FBIO prediction, such as crop modeling
and hyperspectral reflectance (Schut et al., 2018; Maimaitijiang et al., 2020).
Proximal and remote sensing of spectral reflectance are known as non-destructive, non-
invasive phenotyping approaches for measuring the spectral properties of plants at low
cost in large breeding populations (Araus and Cairns, 2014; Rutkoski et al., 2016b;
Montesinos-López et al., 2017b; Tardieu et al., 2017). These methods are considered
proximal when the spectral properties are collected in crops’ proximities (Araus and
Cairns, 2014). The use of proximal/remote sensing for discriminating among genotypes
with different potential is well documented in different crops such as wheat (Montesinos-
López et al., 2017b), alfalfa (Feng et al., 2020b), rice (Zha et al., 2020), sorghum (Lobell
et al., 2020), and soybean (Yoosefzadeh-Najafabadi et al., 2021b). The use of spectral
properties of crops for predicting complex traits is genotype-specific and depends on the
nature of the complex trait (Kycko et al., 2018; Schweiger et al., 2018). Therefore, it is
necessary to evaluate and specify the possibilities of using spectral properties to predict
complex traits for each plant species (Yoosefzadeh-Najafabadi et al., 2021b).
Several vegetation indices are generated from the spectral reflectance and commonly
used for estimating important agronomic traits such as plant height, biomass, resistance
to different diseases, and yield in different plant species, including wheat (Hassan et al.,
92
2018), barley (Bendig et al., 2014), cotton (Duan et al., 2017), and sorghum (Watanabe
et al., 2017). The normalized difference vegetation index (NDVI), for example, is known
as an important spectral vegetation index, which is broadly used for monitoring seasonal
growth developments in plants and estimating canopy biomass, photosynthesis rate, leaf
area index, and yield formation (Jolly et al., 2005; Filippa et al., 2018b). Depending on
bands used for calculating NDVI, different types of NDVI have been established (Feng et
al., 2020b). The spectral vegetation indices are generally species-specific and site-
specific. Therefore, one of the major obstacles in their applications is their low prediction
powers on morphological and physiological processes if they are not calibrated for the
target plant species (Araus et al., 2018). In addition, analyzing large hyperspectral
reflectance data sets requires intensive computational and statistical analyses, which is
still challenging in many plant breeding programs (Lopez-Cruz et al., 2020). To address
the latest issue, machine learning (ML) algorithms are considered as reliable and efficient
computational approaches for predicting complex traits (Nezami-Alanagh et al., 2018;
Feng et al., 2020b; Hesami et al., 2020d).
The functionality of ML algorithms is based upon developing empirical relationships
between the target and input variables (Zhang et al., 2019b; Feng et al., 2020b).
Therefore, several ML algorithms were implemented for predicting yield in many crops,
including artificial neural networks (Rodriguez-Galiano et al., 2015), support vector
regression (Gandhi et al., 2016), and random forests (Yoosefzadeh-Najafabadi et al.,
93
2021b). The successful use of ML algorithms has been documented in several plant
species such as chrysanthemum (Hesami et al., 2020d), alfalfa (Feng et al., 2020b),
wheat (Bhojani and Bhatt, 2020), soybean (Yoosefzadeh-Najafabadi et al., 2021b;
Yoosefzadeh-Najafabadi et al., 2021c), and maize (Jin et al., 2016).
As one of the important subsets of ML, deep learning (DL) can automatically learn from
a large hierarchical representation of the data by using complex non-linear functions that
are trained from the outputs of the previous layers (You et al., 2017; Maimaitijiang et al.,
2020). Convolutional neural networks (CNN), deep neural networks (DNN), and long-
short term memory (LSTM) are the three most commonly used DL methods that are
broadly used in remote sensing (Benedetti et al., 2018), medical applications (Shen et al.,
2017), and human activity recognition (Ravi et al., 2016). Although few studies have
exploited DL for crop yield prediction (Yu et al., 2016; Maimaitijiang et al., 2020), the
implication of hyperspectral vegetation indices (HVI) for predicting soybean yield and
FBIO has not been exploited.
Using only one ML algorithm, especially when the training dataset is small and limited,
may result in some level of overfitting in the ultimate prediction (Ali et al., 2014; Feng et
al., 2020b). In order to address this problem and improve the prediction performance of
individual ML algorithms, different methods have been proposed (Ribeiro and dos Santos
Coelho, 2020). One of these effective methods is the ensemble method, in which the
prediction performance of different ML algorithms is combined to estimate the final
94
prediction (Dietterich, 2000). The principle of ensemble methods is built upon the divide-
and-conquer paradigm (Ribeiro and dos Santos Coelho, 2020). This idea is usually
implemented to find an optimal solution to the proposed problem (Divina et al., 2018). The
successful application of ensemble methods to predict complex traits using related traits
is well documented in maize (Jin et al., 2016), alfalfa (Feng et al., 2020b), and sugarcane
(Fernandes et al., 2017). The three most commonly used ensemble algorithms are
boosting, bagging, and stacking (Dietterich, 2000; Feng et al., 2020b). Bagging was first
introduced by Breiman (1996) as a variance reduction method for several ML algorithms
such as radial basis function and random forest (Galar et al., 2011). The bagging method
is based on the parallel ensemble algorithm, while boosting was introduced by Schapire
(1999) as a sequential ensemble model of base algorithms by using the dependencies of
each ML algorithm (Dietterich, 2000; Feng et al., 2020b). Although boosting and bagging
methods are commonly used for the combined homogeneous ML algorithms, stacking
methods are mostly exploited for heterogeneous ML algorithms and adjust the difference
between algorithms to increase the ultimate prediction accuracy (Dietterich, 2000; Zhou,
2012; Feng et al., 2020b).
In addition to the use of ensemble and DL methods for predicting yield and FBIO at early
growth stages using HVI, it would be valuable to determine the optimum values of the
HVI in genotypes with maximized yield and FBIO production. Therefore, the
implementation of optimization algorithms can be beneficial to cultivar development
95
programs. The successful use of optimization algorithms was explored in some areas of
plant science such as plant tissue cultures (Hesami et al., 2019b; Hesami and Jones,
2020a) but not yet in the field of plant breeding. In order to concurrently maximize two or
more traits, multi-objective evolutionary optimization algorithms can be employed for
finding the optimal level of inputs for different objective functions (Dao and Guo, 2020b).
One of the most commonly used multi-objective optimization algorithms is the improved
version of the strength Pareto evolutionary algorithm 2 (SPEA2) (Zhao et al., 2016). To
the best of our knowledge, this study is the first report of using the SPEA2 algorithm for
optimizing HVI that can be used for selecting soybean genotypes with high yield and FBIO
production potential within a breeding program. The main objectives of this study were:
(1) investigating the potential benefit of using HVI for predicting soybean yield and FBIO;
(2) the comparative analysis of the EB strategy and DNN algorithm to improve yield and
FBIO prediction accuracies; and (3) introducing DNN-SPEA2 algorithms as a reliable tool
for optimizing the HVI values associated with soybean genotypes with maximized yield
and FBIO production. The outcome of this study can provide soybean breeders with new
effective methods for selecting genotypes with high yield potential and FBIO production
at early growth stages in large breeding populations.
96
2.3 Materials and Methods
2.3.1 Plant Material
In this study, phenotypic data were collected using 250 soybean genotypes, which are
the core germplasms of the soybean breeding program at the University of Guelph,
Ridgetown, and are used for sustainable cultivar development and genetic studies. The
ID, name and pedigree of the tested genotypes were presented in Table 2.1.
2.3.2 Test sites and experimental designs
The soybean genotypes were cultivated under the field condition at Ridgetown
(42°27'14.8"N 81°52'48.0"W, 200m above sea level) and Palmyra (42°25'50.1"N
81°45'06.9"W, 195 m above sea level) in 2018 and 2019 in Ontario, Canada. The
experiments were conducted using randomized complete block designs (RCBD) with two
replications in four environments (two locations and two years), consisting of 2000
research plots in total. Each plot consisted of five rows, 4.2 m long each and 40 cm
spacing between each row, and the seedling rate was 50-57 seeds per m2. All phenotypic
data were adjusted using nearest neighbor analysis (NNA) to remove the possible spatial
variation in the field (Stroup and Mulitze, 1991; Bowley, 1999; Katsileros et al., 2015a).
97
2.3.3 Data acquisition
2.3.3.1 Field data collection
Seed yield (kg ha-1) for each plot was measured by harvesting three middle rows and
adjusted to 13% moisture and maturity data. The above-ground biomass samples (g m-2)
for each plot were collected from the first and last rows at the beginning seed (R5) growth
stage (Fehr et al., 1971) in 1.0 m sections per row that were randomly selected among
designated sections. The best linear unbiased prediction (BLUP) was implemented in
order to estimate the average phenotyping plots of maturity and yield for each soybean
genotype (Goldberger, 1962).
2.3.3.2 Hyperspectral reflectance data collection
In each plot, the hyperspectral reflectance properties were acquired at the R5 growth
stage using UniSpec-DC Spectral Analysis System (PP Systems International, Inc. 110
Haverhill Road, Suite 301 Amesbury, MA 01913 USA). The device covers 250 reflectance
bands ranging between 350–1100 nm with a bandwidth of 3 nm. The sensors were able
to cover a sample area of 0.25 m2 due to their 25° field-of-view properties. The “proximal
sensing” method was used for measuring hyperspectral reflectance due to the proximity
of the sensor to the canopy. For calibrating and reducing the incoming solar radiation, the
dark reference and the Spectralon panels were used for the dual and basal channels,
respectively. Also, all hyperspectral reflectance data were collected as close to solar noon
98
as possible for minimizing the signal noise associated with the environment. The
hyperspectral reflectance of each location measured at one day with one device.
2.3.4 Data pre-processing and statistical analyses
Several random noises occurring across the whole spectra can lead to misinterpretation
of the results. This is mainly caused by electronic and sensor fluctuations (Ozaki et al.,
2006). Therefore, three hyperspectral reflectance measurements were conducted for
each plot at one spot in order to reduce the possible noises, and the BLUP values were
calculated for hyperspectral reflectance bands. Also, hyperspectral reflectance data at
the upper (1005-1100 nm) and lower (350-395 nm) edges of the spectra were eliminated
from the original data because of the missing data and noises. Data centering, scaling,
and a Savitzky–Golay filter were applied for improving spectral properties and signal-to-
noise ratio (Savitzky and Golay, 1964; Rossel, 2008). All the pre-processing procedures
were conducted using the R software, version 3.6.1.
Overall, the following statistical model was used in this study:
𝑌𝑖𝑗 = 𝜇 + 𝑓(𝑠) + 𝐺𝑖 + 𝐸𝑗 + 𝐺𝐸𝑖𝑗 + 𝜀𝑖𝑗 , 𝑖 = 1, … , 𝑘; 𝑗 = 1, … , 𝑛 (Eq 3.1)
Where Yij stands for the trait of interest (soybean seed yield, maturity, and hyperspectral
vegetation indices) as a function of an intercept μ, f(s) stands for the spatial covariate, Gi
99
is the random genotype effect, Ej stands for the fixed environment effect, GEij is the
genotype×environment interaction effect, and εij stands for the residual effect
2.3.5 Hyperspectral vegetation index (HVI) extraction
In order to reduce the hyperspectral data dependency, the vegetation indices were used
to predict the soybean yield and FBIO instead of using the original reflectance bands.
Therefore, 34 vegetation indices were selected based on the published papers (Table
3.1), each characterizing two spectral bands. The used vegetation indices were
categorized into the two most well-known indices, the normalized difference vegetation
index (NDVI) and the simple ratio index (SRI). In this study, the 34 vegetation indices
were considered as the input variables and yield or FBIO was tagged as the target
variable. The Pearson correlation coefficients among input and output variables was
evaluated in order to estimate the relevancy of each HVI to the soybean yield and FBIO.
2.3.6 Variable selection
One of the important approaches to determine the relationship between input and output
variables is sensitivity analysis (Huang et al., 2016). Several methods were developed
and used as sensitivity analysis in different field of studies such as predicting crash
frequency (Huang et al., 2016), crash injury severity prediction (Zeng and Huang, 2014),
computer science (Gal and Greenberg, 2012), and plant science (Yoosefzadeh-
Najafabadi et al., 2021a). Recently developed sensitivity analysis method of variable or
100
feature selection techniques are usually used to determine important input variables in
predicting given target traits. In this study, recursive feature elimination (RFE), as one of
the most common wrapper methods, was applied for estimating the importance of
selected HVI for predicting yield and FBIO. Due to its easy to configure property (Chen
and Jeong, 2007), RFE has the ability to effectively extract important variables in a short
time. More detail about the RFE can be found in Yoosefzadeh-Najafabadi et al. (2021b).
All the RFE analyses were carried out through the caret package (Kuhn, 2008) in R
software, version 3.6.1.
2.3.7 Yield and FBIO prediction model calibration and validation
2.3.7.1 Ensemble method (Ensemble-Bagging algorithm)
Three of the most common used ML algorithms, i.e., the radial basis function (RBF),
support vector regression (SVR), and random forest (RF), were selected as the base
learners for the ensemble method. RBF is a neural network algorithm, which applies
approximate multivariate functions to predict the output variable using multi-dimensional
input variables (Orr, 1996; Wilamowski and Jaeger, 1996). SVR is a regression version
of the support vector machine algorithm that uses hyperplanes to discover the optimal
separation of values in output variables (Vapnik, 1995). In SVR, the input and output
variables are transformed based on kernel functions from the original space to a support
vectors space, and the linear function is used to eliminate errors between the insensitive
101
loss function and the training data (Ye et al., 2009; Feng et al., 2020b). The third ML
algorithm used in this study was RF, which is based on generating a series of trees
representing a subset of independent observations (Liaw and Wiener, 2002), and
estimates the final prediction by averaging the production of all determined trees (Hesami
and Jones, 2020a). We implemented an EB based on the bagging strategy to improve
the prediction performance of individual algorithms as described by Ye et al. (2009). For
this aim, we combined all the prediction performance of the tested ML algorithms and
selected the one with the highest performance as the MetaClassifier for the EB. Each
parameter in the tested ML algorithms was tuned and adjusted based on the input/output
variables. All ML and EB analyses were conducted using the workbench for machine
learning (Weka) software version 3.9.4 (Hall et al., 2009).
2.3.7.2 Deep Neural Network (DNN)
Thanks to recent advances in the performance of complex computational procedures,
several DNN approaches have been introduced with multiple processing hidden layers
(Du and Xu, 2017). The DNN algorithm has the ability to learn the multivariate mapping
function between the output and input variables. The scheme of the used DNN algorithm
is illustrated in Figure 3.1. In this study, the multilayer perceptron (MLP) as one of the
feed-forward neural networks was used to predict the soybean yield and FBIO at an early
growth stage using HVI. The structure and function of MLP were explained in detail by
Pal and Mitra (1992b). In brief, MLP contains input, output, and hidden layers of M
102
interconnected neurons. Each layer in the MLP algorithm has full connections to the next
layer, so the output of layer M would be the input of the M +1 layer (Dao and Guo, 2020a).
In this study, optimizing MLP parameters was conducted using GUI editors in Weka
software version 3.9.4. In general, there were four hidden layers with 17 nodes in each
layer, a learning rate of 0.3 and a momentum of 0.2 (Figure 3.1).
2.3.8 Optimization process (SPEA2 algorithm)
Multi-objective optimization algorithms are usually implemented for two or more than two
outputs that sometimes are in conflict with each other. SPEA2 is a sophisticated
optimization algorithm that can be successfully implemented for finding the optimum
Pareto front solutions (Figure 3.2). In this study, based on the best algorithms for soybean
yield and FBIO predictions, we used SPEA2 to find the optimum levels of HVI for
maximizing yield and FBIO as objective functions. The main steps of SPEA2 were
summarized in Figure 3.2. To achieve the best results, population size along with three
of the most important parametric mechanisms in the SPEA2 algorithm (i.e., crossover,
selection, and mutation operations) were determined (Maheta and Dabhi, 2014). The
overall population size, generation number, uniform crossover, and mutation rate were
respectively set to 400, 1000, 0.8, and 0.6. The two-point crossover method was used to
select elite populations for crossover and obtain the appropriate fitness. More details
about the SPEA2 environmental selection methods and fitness assignment can be found
in (Zitzler et al., 2001).
103
2.3.9 Quantification of model performance and error estimations
In order to quantify the performance of EB and DNN for predicting soybean yield and
FBIO using HVI, the 250 soybean genotypes were randomly split into testing and training
sets by using five k-fold cross-validations (80-20% splits) with ten repetitions (Figure 3.3),
and root mean square error (RMSE, Eq 3.2), mean absolute errors (MAE, Eq 3.3), and
coefficient of determination (R2, Eq 3.4) were measured as follows:
𝑅𝑀𝑆𝐸 = √∑(Y′−𝑌)2
𝑛 (Eq 3.2)
𝑀𝐴𝐸 =∑ |Y′
𝑖−𝑌𝑖|𝑛𝑖=1
𝑛 (Eq 3.3)
𝑅2 =𝑆𝑆𝑇−𝑆𝑆𝐸
𝑆𝑆𝑇 (Eq 3.4)
where, Y′ and Y are the predicted and measured values, respectively, n is the
number of observations, SST stands for the sum of squares for total, and SSE stands for
the sum of the squares for error.
2.4 Results
2.4.1 Yield, FBIO, and HVI properties
In this study, the average FBIO of the 250 soybean genotypes grown in four environments
had the range of 1366.9 to 2548.5 g m-2 with a mean and standard deviation of 1948.9
104
and 336.36 g m-2, respectively. The average yield of the 250 soybean genotypes ranged
from 2.38 to 5.71 t ha-1 with a mean of 4.19 t ha-1 and a standard deviation of 0.58 t ha-1.
The distributions of soybean yield and FBIO production for each genotype across the four
environments are shown in Figure 3.4. In general, the variation in the yield was larger
than for the FBIO among soybean genotypes across the environments.
2.4.2 Correlation analysis of HVI vs. soybean yield and FBIO
The linear associations of HVI with yield and FBIO was measured using Pearson
correlation coefficients (Figure 3.5). The highest positive correlation was between yield
and maturity with a value of 0.94. All HVI showed significant correlations with both
soybean yield and fresh biomass. Among all the tested HVI, three of them had positive
correlations, and 31 HVI showed negative correlations with both soybean yield and FBIO.
While S14 showed the highest positive correlation (0.71) with yield, the highest negative
correlation was found to be between N5 and yield (-0.84). S1 showed the lowest positive
correlation with the soybean yield (0.60), and the lowest negative correlation was with N1
(-0.15). Regarding the correlation of HVI with soybean FBIO, S9 had the highest positive
correlation (0.75), and N5 had the highest negative correlation (-0.90). The lowest positive
and negative correlations with FBIO were found with S1 (0.62) and N1 (-0.18). Since all
the tested HVI were significantly correlated with soybean yield and FBIO, all HVI was
used as the input variables in EB and DNN algorithms.
105
2.4.3 Comparative analysis of the EB and DNN algorithms
One of the objectives of this study was to compare the efficacy of EB and DNN algorithms
for predicting soybean yield and FBIO from HVI data collected at the R5 growth stage.
Among all the tested ML algorithms used for establishing the EB method, the SVR
algorithm outperformed the other algorithms (Table 3.2 and 3.3) and, therefore, was
selected as the MetaClassifier for the EB strategy. The results of this study showed that
the EB method had a slightly, but not significantly, higher R2 values (0.77) over DNN
(0.76) in predicting yield from HVI (Figure 3.6-A), with the lowest RMSE and MAE values
of 224.97 and 149.28 kg ha-1, respectively (Figure 3.6-B, C). Regarding the FBIO
prediction results, the highest R2 was obtained using DNN with the value of 0.91 (Figure
3.6-D). The DNN algorithm also had the lowest RMSE and MAE values of 102.66 and
80.09 g m-2, respectively (Figure 3.6-E, F).
2.4.4 Variable Selection
In this study, RFE as one of the common variable selection methods was used in order
to select the top 10 important HVI in predicting the yield and FBIO. The results (Figure
3.7) showed that N5 had the highest importance value in predicting both yield (11.24) and
FBIO (12.49), followed by the RNDVI, S8, N2, and S6 indices as the most important HVI
in explaining soybean yield and FBIO. While the least important HVI for predicting yield
was S1 (0.014), N10 (0.70) was the least important HVI for predicting FBIO. The
106
prediction accuracy of both EB and DNN for predicting the soybean seed yield and FBIO
using the selected 10 HVI was found not to be significantly lower than what were obtained
using all the 35 HVI (data was not shown).
2.4.5 The DNN-SPEA2 optimization algorithm
The DNN algorithm, with the highest prediction accuracies for predicting both soybean
yield and FBIO from HVI, was selected and linked to the SPEA2 optimization algorithm
for estimating the optimized values of HVI in soybean genotypes with maximized yield,
4445.2 kg ha-1, and FBIO, 2254.8 g m-2 (Figure 3.8).
2.5 Discussion
Several studies have reported significant associations between HVI and the two important
agronomic traits of yield and FBIO production in several crop species such as sorghum
(Galli et al., 2020), corn (Mladenova et al., 2017), rice (Zhou et al., 2017), potato (Bala
and Islam, 2009), and wheat (Aparicio et al., 2000). In this study, we also confirmed
significant correlations between HVI and both yield and FBIO by evaluating 250 soybean
genotypes across four environments. All the tested HVI was extracted based on the
previous studies and their correlations with the tested variables (yield and FBIO). Also,
we attempted to select the tested HVI from different spectral regions (e.g., blue, green,
red, red-edge, and near-infrared regions) to investigate the potential use of different
107
spectral regions in predicting soybean seed yield and FBIO. Among all the tested HVI,
N5 had the highest negative correlation with both yield and FBIO. The N5 index consists
of two specific wavelengths, 800 nm and 670 nm (Tucker, 1979). The 670 nm band
located in the red region of the spectral was previously introduced as one of the most
important reflectance bands in agricultural studies (Ollinger, 2011; Mariotto et al., 2013;
Thenkabail et al., 2013; Hennessy et al., 2020). This reflectance band has shown strong
correlation with the chlorophyll contents in plants, which is considered as one of the most
pivotal components in determining the ultimate yield and FBIO production (Blackburn,
2007). The 800 nm band is placed within the near-infrared region of the spectrum (G.
Jones et al., 2010; Hennessy et al., 2020) and has been previously reported to be involved
in characterizing the structure of the leaves in plants. More specifically, this band is
associated with different levels of liquid water in the inter-cellular spaces of leaves
(Knipling, 1970). Changes in the balance of water and air in the inter-cellular space in
leaves have significant impact on ultimate yield and FBIO production (CABRERA‐
BOSQUET et al., 2011). Based on our previous report (Yoosefzadeh-Najafabadi et al.,
2021b), the red region, which is ranged mostly from 655 to 675 nm, had the highest
association with the soybean yield formation. In this study, we found the importance of
red wavelength in predicting FBIO as well. Therefore, N5 can be considered as one of
the most informative indices for predicting the final yield and FBIO production in soybean,
hence, for discriminating genotypes with different genetic potential for the two traits.
108
Among all the tested HVI, the S14 and S9 indices had the highest positive correlations
with yield and FBIO, respectively. The S14 index, introduced by Mutanga and Skidmore
(2004), has consisted of only red-edge reflectance bands. The S9 index, which was
proposed by Chappelle et al. (1992), is built upon both red and red-edge reflectance
regions. The red-edge reflectance bands have been reported to have strong correlations
with the chlorophyll contents and used as an explanatory indicator for senescence and
stress in plants (Gholizadeh et al., 2016; Hennessy et al., 2020). Several studies reported
the efficiency of using red and red-edge regions in discriminating plant characteristics
such as yield (Mehdaoui and Anane, 2020), biomass (Guerini Filho et al., 2020), and
disease (Fernández et al., 2020).
The potential yield and FBIO production of a given genotype can be estimated by the
level of chlorophyll and the senescence rate, especially at the later development stages
(Babar et al., 2006). In this study, the hyperspectral reflectance data were collected at the
R5 growth and development stage, in which pods and seeds are developing, and plants
have initiated the senescence from the lower leaves. These special properties of the two
indices can explain their high positive correlations with yield and FBIO. Also, RFE analysis
indicated the high importance score for N5 in predicting soybean yield and FBIO. N5
consisted of two reflectance bands, 670 and 800 nm, which were placed in red and near-
infrared regions, respectively. A previous study also obtained the high importance score
for the near-infrared regions and red regions in predicting soybean yield classes,
109
specifically the high yielding class (Yoosefzadeh-Najafabadi et al., 2021a). Here, the
efficiency of using these two reflectance bands in predicting soybean yield and FBIO was
verified as single hyperspectral vegetation index.
In general, qualitative traits can be analyzed using conventional statistical methods
(Homack, 2001; Rutherford, 2001); however, estimating complex quantitative traits such
as yield and FBIO from HVI data sets in large breeding populations require more
sophisticated statistical methods such as machine or deep learning algorithms (Vapnik,
2013). Recently, the use of ML algorithms was successfully reported in many crop
species such as soybean (Yoosefzadeh-Najafabadi et al., 2021b), wheat (Montesinos-
López et al., 2017a), chrysanthemum (Hesami et al., 2019b), alfalfa (Feng et al., 2020b),
and rice (Gandhi et al., 2016). However, in some cases, the potential issue of overfitting
through using individual ML algorithms is inevitable (Yeom et al., 2020). To overcome this
shortcoming, different ensemble algorithms have been introduced, in which the outcomes
of different ML algorithms are combined for better predictive performance (Yang et al.,
2010). In this study, the RBF, SVR, and RF algorithms were used as individual ML
algorithms to construct the ensemble strategy based on the bagging principles, and SVR
with the highest accuracy was selected as the MetaClassifier.
The superior accuracy performance of SVR, compared to other ML algorithms tested in
this study, can be due to the use of structural than empirical risk minimization inductive
principles (Belayneh et al., 2014; Hesami et al., 2020d).
110
In addition to EB, the DNN algorithm was also used to predict soybean yield and FBIO
production using HVI. Recently, DNN algorithms have attracted the attention from several
researchers in different areas such as object detection (Mahalingam and Subramoniam,
2020), image processing (Nigam et al., 2020), disease detection (Kukana, 2020), and
various engineering areas (Alam and Mukhopadhyay, 2019). The DNN algorithms are
commonly used in non-linear problems because they are easy to implement and can
overcome the adverse effect of data dimensionality (Goodfellow et al., 2016). Although
there are some ambiguities around considering MLP as a type of DNN, several studies
reported MLP as a subset of DNN (Bisong, 2019; Ge et al., 2019; Dao and Guo, 2020a).
In this study, the DNN algorithm outperformed the EB algorithm since DNN has the ability
to learn the weights between nodes through unsupervised learning instead of generating
weights randomly in the initialization stage (Makantasis et al., 2015). After that, DNN tries
to adjust weights during the learning process. This property can improve the prediction
accuracy of DNN in compared to other ML algorithms.
One of the most important objectives for plant breeders is to improve the discrimination
ability among plant genotypes with different yield and FBIO potentials in large breeding
populations, directly or indirectly (Slinkard et al., 2000). Direct selection methods are
focused on selecting the superior genotypes based on the average performance of the
target trait across many locations and over several years, while indirect selections are
based upon selecting for high heritable secondary traits that are strongly associated with
111
the target trait (Jin et al., 2010). Therefore, efficient plant breeding programs are
established based on combined direct and indirect selection strategies in order to
maximize the accuracy of the selections (Ma et al., 2001; Jin et al., 2010; Montesinos-
López et al., 2017b).
In order to move toward efficient breeding, it is important to have an estimation of the
optimized level of each secondary trait in superior genotypes. This goal can be
accomplished by using optimization algorithms such as the improved strength Pareto
evolutionary algorithm (SPEA2). SPEA2 is used for multi-objective optimization problems
to figure out the Pareto-optimal set (Zitzler et al., 2001). In this study, the DNN as the
superior algorithm in predicting the soybean yield and FBIO production was linked to
SPEA2 to optimize the HVI for maximizing yield and FBIO production (Figure 3.8). The
successful use of SPEA2 was reported in different areas such as engineering (Shigang
and Qian, 2007), material science (Zaloga et al., 2020), and economics (King et al., 2010),
but not in plant science.
To the best of our knowledge, this is the first report of using SPEA2 in the plant science
area. The advantages of SPEA2, over other multi-objective optimization algorithms, is
based on its improved fitness assignment scheme that results in understanding of the
number of individuals that dominant or are dominated by others in the data sets (Zitzler
et al., 2001). This technique empowers SPEA2 to optimize all HVI, precisely, in soybean
genotypes with maximized yield and FBIO production. By having this information,
112
soybean breeders may be able to design the “ideotype” genotypes with optimum values
for hyperspectral vegetation indices, which are potentially high seed yield and FBIO
genotypes. This approach may be used in large breeding populations, whereas selection
strategies are based on the latter criteria.
113
Table 2.1 Summary of the selected hyperspectral vegetation indices (HVI) used in this study.
Index Category Abbreviation Formula Reference
Normalized difference vegetation index (NDVI)
N1 [584nm – 471nm]/[584nm + 471nm]
(Van Der Meij et al., 2017)
N2 [689nm − 521nm]/[689nm + 521nm]
(Van Der Meij et al., 2017)
N3 [760nm − 550nm]/[760nm + 550nm]
(Zhao et al., 2015)
N4 [740nm − 667nm]/[740nm + 667nm]
(Yu et al., 2014)
N5 [800nm − 670nm]/[800nm + 670nm]
(Tucker, 1979)
N6 [750nm − 705nm]/[750nm + 705nm]
(Gitelson and Merzlyak, 1994)
N7 [750nm − 710nm]/[750nm + 710nm]
(Wu et al., 2009)
N8 [780nm − 710nm]/[780nm + 710nm]
(Kooistra et al., 2013)
N9 [750nm − 710nm]/[750nm + 710nm]
(Mutanga and Skidmore, 2004)
N10 [732nm − 717nm]/[732nm + 717nm]
(Mutanga and Skidmore, 2004)
N11 [820nm − 720nm]/[820nm + 720nm]
(Thenkabail et al., 2000)
N12 [750nm − 735nm]/[750nm + 734nm]
(Mutanga and Skidmore, 2004)
Normalized difference red edge (NDRE)
NDRE [790nm − 720nm]/[790nm + 720nm]
(Rodriguez et al., 2006)
Green normalized difference vegetation index (GNDVI)
GNDVI [750nm − 550nm]/[750nm + 550nm]
(Gitelson et al., 1996)
Renormalized difference vegetation index (RDVI)
RDVI [800nm – 670nm]/[800nm + 670nm]
(Roujean and Breon, 1995)
Simple ratio index (SRI)
S1 [565nm/533nm] (Tian et al., 2011a) S2 [750nm/550nm] (Datt, 1999) S3 [760nm/550nm] (Zhao et al., 2015) S4 [810nm/560nm] (Xue et al., 2004) S5 [734nm/629nm] (Yu et al., 2014) S6 [810nm/660nm] (Zhu et al., 2008) S7 [700nm/670nm] (McMurtrey Iii et al.,
1994) S8 [800nm/670nm] (Chen et al., 2010) S9 [675nm/700nm] (Chappelle et al.,
1992) S10 [800nm/680nm] (Sims and Gamon,
2002) S11 [752nm/690nm] (Datt, 1999)
114
S12 [750nm/700nm] (Datt, 1999) S13 [750nm/705nm] (Gitelson and
Merzlyak, 1994) S14 [706nm/755nm] (Mutanga and
Skidmore, 2004) S15 [747nm/708nm] (Gupta et al., 2003) S16 [750nm/710nm] (Zarco‐Tejada and
Miller, 1999) S17 [741nm/717nm] (Gupta et al., 2003) S18 [735nm/720nm] (Gupta et al., 2003) S19 [738nm/720nm] (Gupta et al., 2003)
115
Table 2.2 Analysis performance of random forest (RF), radial basis function (RBF), and support vector regression (SVR) algorithms for soybean yield prediction using yield component traits.
Algorithm MAE (Kg ha-1) RMSE (Kg ha-1) Coefficient of determination (R2)
RF 194.4164392 264.3694044 0.851041562
RF 153.6659487 225.8010471 0.868714008
RF 196.0085792 255.6093528 0.825276419
RF 128.2046264 162.9308531 0.898963613
RF 190.9728488 292.831912 0.771887352
RF 161.2443396 206.7116935 0.880600038
RF 237.3928162 378.5903199 0.786487711
RF 174.5036521 230.900251 0.870352877
RF 167.7280337 228.6508868 0.821401288
RF 156.1771348 219.7076022 0.8658249
RF 152.5430389 228.1215753 0.852017326
RF 195.607481 266.2181885 0.821569354
RF 196.9190055 324.9220828 0.81987753
RF 199.3637476 255.9968455 0.820151912
RF 176.6348179 249.9933007 0.807212658
RF 164.6068529 233.8554832 0.841945295
RF 197.7039414 253.489436 0.801026804
RF 176.9438476 297.3525796 0.795019998
RF 162.9480293 262.3442671 0.860342859
RF 149.6396158 189.241463 0.884526328
RF 165.5872226 210.4387478 0.841994732
116
RF 214.2586298 331.5895186 0.777967825
RF 140.8431341 193.2645868 0.917992395
RF 177.3601749 246.9138108 0.833648229
RF 187.8457237 258.4256551 0.820404094
RF 173.4347837 225.0137528 0.885411206
RF 172.26188 246.0583511 0.853459052
RF 137.8932001 181.087718 0.865382502
RF 144.5719989 216.6240212 0.878340108
RF 230.0856937 345.8972946 0.767554612
RF 203.8437002 288.4935979 0.820017696
RF 159.8103711 204.2963386 0.882240097
RF 147.2445154 208.7225649 0.868152667
RF 190.816143 316.930276 0.801753073
RF 159.0294133 199.275656 0.851906349
RF 187.0878757 254.7363732 0.791581404
RF 169.7205721 236.6627781 0.80833428
RF 150.2662134 186.5224665 0.889374882
RF 169.5110898 234.2206669 0.871887243
RF 211.0291434 321.450775 0.795012935
RF 147.178802 208.1800056 0.846933853
RF 160.1442493 213.4685974 0.881260499
RF 158.9773195 216.8433078 0.88888212
RF 176.2874827 254.0229866 0.798749809
RF 212.6379021 309.14105 0.816370719
117
RF 191.7389196 258.0168978 0.84423525
RF 185.521718 298.2415012 0.769271068
RF 163.5115468 237.5566316 0.827595832
RF 149.2701387 195.7500659 0.869894049
RF 159.155492 230.8599248 0.878819171
RBF 166.482403 228.1225591 0.895936303
RBF 141.8539799 209.7292324 0.88755794
RBF 177.4770866 244.498738 0.82952697
RBF 124.4508702 159.3585312 0.89595357
RBF 171.7851253 284.0133451 0.786188535
RBF 151.7472256 199.2448359 0.889020262
RBF 191.3819275 332.3332549 0.843649759
RBF 156.3124725 217.4039771 0.875875273
RBF 151.2263493 194.6464191 0.870931493
RBF 161.1456794 250.2351497 0.819443301
RBF 140.899325 199.494637 0.891653504
RBF 166.5171916 237.9066547 0.861616795
RBF 174.9630045 298.9875977 0.865127154
RBF 161.806343 204.3550293 0.894761192
RBF 179.6671577 265.7072244 0.773685629
RBF 134.2890418 185.6078136 0.898262776
RBF 168.361031 213.8865503 0.855724035
RBF 158.4137791 264.0797234 0.841063207
RBF 184.1080335 319.0615869 0.781863737
118
RBF 138.1787432 172.1319603 0.906035473
RBF 152.0514538 193.1760534 0.867119796
RBF 194.512439 322.1013565 0.794374653
RBF 120.8858254 177.1896796 0.939188454
RBF 162.7073572 219.3045294 0.867315056
RBF 148.6239503 193.105992 0.899124255
RBF 173.313671 258.5462006 0.846279147
RBF 158.3618457 216.1436186 0.863422465
RBF 127.727618 171.1289184 0.880682997
RBF 138.5087609 204.2425557 0.892393716
RBF 196.2541738 313.6464364 0.811598164
RBF 205.8598625 302.3590182 0.788980354
RBF 145.7107933 184.4297027 0.891108606
RBF 136.1508379 183.6820335 0.899823637
RBF 160.0521048 285.9294767 0.855288111
RBF 156.3711753 205.2928041 0.846959883
RBF 170.555398 232.3567345 0.818863266
RBF 147.6561911 193.3014557 0.876408845
RBF 137.9913619 184.7867974 0.898873668
RBF 155.1976974 236.2798182 0.87229602
RBF 213.0368675 323.5050865 0.792917032
RBF 144.3900282 232.0029736 0.809041754
RBF 156.9787925 207.3934681 0.885506922
RBF 152.2256617 215.5828163 0.886302241
119
RBF 162.1911342 232.7485089 0.833034657
RBF 183.7689879 284.8704314 0.858339365
RBF 170.7591561 266.4940022 0.844392614
RBF 177.8848472 294.979878 0.776937776
RBF 152.0130877 203.8784441 0.867692657
RBF 130.3557598 184.9497011 0.883528797
RBF 149.8517638 207.0613214 0.907388731
SVR 169.3193869 254.1831811 0.885709051
SVR 143.898444 240.3086238 0.864607456
SVR 142.4505225 206.9095853 0.87389928
SVR 114.4171545 144.973302 0.924872394
SVR 166.8015806 284.2797896 0.797220479
SVR 153.5666945 216.5995898 0.876691842
SVR 193.1655022 370.7254871 0.827227033
SVR 145.0688545 195.9714909 0.88310517
SVR 136.4520851 156.2586138 0.900808865
SVR 121.9551875 188.203529 0.922553602
SVR 124.3239525 167.8495522 0.90322418
SVR 156.5492272 248.6340724 0.863947409
SVR 195.5975833 337.4621718 0.822087889
SVR 143.1263626 189.1334368 0.919921051
SVR 135.6829301 199.0177349 0.870698029
SVR 136.1574023 198.3776151 0.891903584
SVR 141.7515528 187.6667512 0.889041454
120
SVR 158.7693144 280.4770581 0.827082577
SVR 177.7295967 305.6477374 0.817503223
SVR 133.7576242 173.0237627 0.909123355
SVR 139.1392941 184.7319337 0.853130677
SVR 176.753526 322.1107671 0.820168626
SVR 147.404754 223.3131383 0.920919747
SVR 131.2783698 207.6305114 0.885816755
SVR 142.6487192 200.9875299 0.904211457
SVR 162.7484641 234.0043982 0.889078699
SVR 132.0442351 173.6694425 0.893865214
SVR 108.1486884 133.1067669 0.919326768
SVR 146.1375347 243.512773 0.857081408
SVR 183.566601 330.771371 0.820460837
SVR 170.727708 270.4687548 0.846223367
SVR 139.5506019 171.9184397 0.902174143
SVR 137.2469409 189.3853443 0.911677959
SVR 164.9731378 312.6570646 0.836422846
SVR 126.275367 189.2158338 0.866987442
SVR 155.070386 232.0514709 0.801329204
SVR 111.3052048 164.7640341 0.918301307
SVR 127.1567767 169.4266582 0.912551468
SVR 145.8161745 214.4679096 0.924900433
SVR 194.4827625 333.1412181 0.803457764
SVR 131.0953209 193.3740054 0.870683683
121
SVR 142.4606988 189.964346 0.902140386
SVR 160.7865427 251.4286331 0.869249999
SVR 131.2447093 212.147326 0.842390416
SVR 175.7521504 300.192103 0.861305686
SVR 152.4749077 232.9133883 0.902423577
SVR 158.2806367 279.7614942 0.808935637
SVR 145.2068802 198.8333128 0.877670785
SVR 134.5474211 210.7963702 0.853422246
SVR 140.6517059 225.3313258 0.903992522
Coefficient of determination (R2), the Root Mean Square Error (RMSE) and the Mean Absolute Errors
(MAE)
122
Table 2.3 Analysis performance of random forest (RF), radial basis function (RBF), and support vector regression (SVR) algorithms for soybean fresh biomass (FBIO) prediction using yield component traits.
Algorithm MAE (g m-2) RMSE (g m-2) Coefficient of determination (R2)
RF 119.7975 148.1414 0.916044
RF 110.2672 143.6928 0.898237
RF 106.1645 128.2284 0.921452
RF 100.1349 122.8048 0.921412
RF 114.0916 149.9142 0.892765
RF 107.7502 133.5622 0.919392
RF 136.5886 169.8279 0.876284
RF 100.7597 119.4719 0.942066
RF 117.6463 151.569 0.884445
RF 111.8181 141.0908 0.912701
RF 111.1065 139.9799 0.909079
RF 136.6951 171.3069 0.855634
RF 104.9303 133.872 0.926967
RF 118.5584 143.4742 0.906125
RF 111.8948 147.6642 0.899771
RF 102.8167 133.2284 0.916527
RF 136.285 164.3982 0.871049
RF 110.7273 155.2275 0.897836
RF 105.2721 125.9429 0.930583
RF 99.7785 125.3235 0.9298
RF 107.0931 137.8202 0.902705
123
RF 108.8778 123.8896 0.929514
RF 100.4996 125.6673 0.935775
RF 111.8781 145.5287 0.905015
RF 127.3078 167.4374 0.876435
RF 116.4748 142.3661 0.915979
RF 110.5773 147.9834 0.901971
RF 107.1008 135.1511 0.900116
RF 95.47278 130.8587 0.921422
RF 119.3775 140.1423 0.922016
RF 115.9131 146.6371 0.905605
RF 99.82979 131.4242 0.922448
RF 103.365 131.1631 0.928199
RF 134.2205 162.0426 0.879752
RF 110.6957 140.9505 0.895337
RF 115.0112 151.9779 0.87949
RF 117.3989 158.8426 0.87588
RF 114.9733 148.3849 0.902559
RF 110.7049 129.3245 0.939484
RF 107.2785 123.5176 0.928689
RF 93.96014 116.238 0.92401
RF 107.2756 137.4207 0.925565
RF 100.7429 125.7191 0.934332
RF 106.1249 127.1983 0.906393
RF 139.2655 178.2483 0.874327
124
RF 119.7131 161.2743 0.893249
RF 111.7995 145.2406 0.902386
RF 113.734 146.0043 0.896196
RF 107.7184 132.2833 0.900077
RF 105.6949 122.2875 0.947621
RBF 93.41103 111.7797 0.955225
RBF 85.39414 105.082 0.944757
RBF 90.18155 117.7106 0.934235
RBF 96.75184 113.9804 0.932339
RBF 101.2754 124.0534 0.928459
RBF 92.50441 107.5343 0.94863
RBF 92.11284 112.781 0.944444
RBF 91.78491 120.0431 0.940552
RBF 111.0718 129.6541 0.917957
RBF 91.79651 116.6371 0.944356
RBF 89.14871 109.7135 0.945941
RBF 103.2617 126.511 0.930062
RBF 85.02968 102.755 0.959404
RBF 94.90637 109.8189 0.951969
RBF 108.4197 137.7227 0.908208
RBF 91.10199 105.8473 0.948014
RBF 102.8248 121.6717 0.929701
RBF 92.63952 115.423 0.942842
RBF 101.0645 135.147 0.919571
125
RBF 89.66994 108.9033 0.94932
RBF 102.3873 126.6338 0.917349
RBF 99.13121 123.9217 0.930152
RBF 82.53234 96.3069 0.968263
RBF 88.80952 113.4791 0.942513
RBF 100.7627 120.4183 0.93747
RBF 107.001 132.0483 0.929261
RBF 96.08942 120.468 0.931921
RBF 91.81762 107.4533 0.936132
RBF 90.18858 115.8784 0.938099
RBF 82.88174 99.21514 0.962078
RBF 101.9572 132.5091 0.918576
RBF 92.25789 112.8294 0.940787
RBF 83.43907 98.93 0.959095
RBF 91.12766 107.1789 0.950698
RBF 100.9256 124.7097 0.919113
RBF 112.9048 134.8489 0.905831
RBF 91.7331 114.089 0.936687
RBF 86.9929 101.501 0.95771
RBF 96.0831 119.6358 0.948123
RBF 94.80073 113.9383 0.940324
RBF 94.31312 118.9435 0.919788
RBF 90.85541 109.2606 0.955173
RBF 93.25251 119.6226 0.940368
126
RBF 92.96258 110.9872 0.928293
RBF 106.8059 128.2234 0.941275
RBF 97.1596 131.7042 0.931538
RBF 91.41567 111.6411 0.942668
RBF 98.10604 118.6544 0.931507
RBF 94.0315 111.2507 0.928674
RBF 86.83321 101.8235 0.963055
SVR 94.51651 111.0089 0.956204
SVR 85.07574 102.711 0.948514
SVR 94.43239 114.0867 0.944061
SVR 93.3968 106.9398 0.945375
SVR 103.4971 119.4833 0.934489
SVR 101.4538 116.8256 0.939164
SVR 86.59942 103.3951 0.960111
SVR 98.85259 121.1884 0.939179
SVR 112.0184 129.8826 0.919976
SVR 84.17188 98.64961 0.962204
SVR 92.37943 114.9125 0.941388
SVR 97.06402 111.2485 0.943611
SVR 100.243 117.3712 0.945837
SVR 91.51115 107.2708 0.954769
SVR 98.83634 115.3385 0.9366
SVR 94.36231 107.5964 0.947994
SVR 104.8689 121.624 0.927531
127
SVR 92.11001 108.4468 0.948885
SVR 109.8608 129.485 0.924026
SVR 89.07842 106.8196 0.950056
SVR 103.1966 122.565 0.921943
SVR 99.622 112.3747 0.943951
SVR 92.36352 110.1389 0.957031
SVR 90.81369 107.6235 0.948475
SVR 105.0717 127.2379 0.931516
SVR 100.7682 115.0609 0.947694
SVR 91.94235 112.9972 0.94265
SVR 95.84098 108.0575 0.934759
SVR 95.55339 112.6115 0.942497
SVR 91.08859 109.7267 0.952543
SVR 92.61422 107.3912 0.947799
SVR 97.09481 115.4407 0.938544
SVR 90.89481 108.9887 0.949821
SVR 96.53723 112.7428 0.947326
SVR 94.41556 112.9802 0.934302
SVR 101.7897 121.7234 0.919102
SVR 87.64503 108.7332 0.94324
SVR 100.2869 113.4994 0.94792
SVR 89.21886 102.1397 0.963119
SVR 104.0221 118.7827 0.933476
SVR 93.67412 107.9619 0.934338
128
SVR 98.18517 115.7734 0.949085
SVR 99.43682 119.8233 0.947145
SVR 100.7466 119.1365 0.920744
SVR 110.8865 130.4639 0.934299
SVR 84.74723 99.74755 0.963307
SVR 100.5082 120.4548 0.934576
SVR 101.529 119.2003 0.935642
SVR 100.1418 110.516 0.931083
SVR 85.23452 103.0854 0.960316
Coefficient of determination (R2), the Root Mean Square Error (RMSE) and the Mean Absolute Errors
(MAE)
129
Figure 2.1 The schematic view of the Deep Neural Network (DNN) algorithm. the green
boxes indicated the HVI as input variables and the orange box indicated the yield and
fresh biomass as output variables. The red circles indicated thee designed neurons for
DNN algorithms.
130
Figure 2.2 The schematic diagram of the Improved version of the strength Pareto
evolutionary algorithms-2 (SPEA2) as multi-objective optimization algorithm.
131
Figure 2.3 The experimental workflow of algorithm selection and validation to predict the
soybean fresh biomass (FBIO) and seed yield using hyperspectral vegetation indices
(HVI). DNN; deep neural network, EB; ensemble bagging. HVI were considered as input
and yield as well as FBIO were considered as output variables. Then, the dataset divided
into training and testing based on 5-fold cross validation strategy. All the tested machine
learning algorithms were trained based on training dataset and their efficiencies were
tested using testing dataset. The appropriate algorithm was selected based on the
comparative analysis of different machine learning algorithms in predicting soybean seed
yield using hyperspectral reflectance data.
132
Figure 2.4 The boxplot variation of A) yield and B) fresh biomass (FBIO) across four
environments. The current boxplot is based on yield and FBIO data of each genotype
collected from the four tested environments.
133
Figure 2.5 Pearson correlation analysis of hyperspectral vegetation indices (HVI),
soybean seed yield, and fresh biomass (FBIO). The colour coded heatmap scale is
provided.
134
Figure 2.6 Violin plots representing the coefficient of determination (R2), mean absolute
errors (MAE), and root mean square error (RMSE) of yield (A, B, and C) and fresh
biomass (D, E, and F) performance of the deep neural network (DNN) algorithm and
ensemble-bagging (EB) strategy for soybean yield and fresh biomass prediction using
hyperspectral vegetation indices (HVI).
135
Figure 2.7 The importance of the tested hyperspectral vegetation indices (HVI) based on
the recursive feature elimination (RFE) strategy for predicting soybean seed yield and
fresh biomass (FBIO).
136
Figure 2.8 Optimized hyperspectral vegetation indices (HVI) values in soybeans with
maximized seed yield and fresh biomass (FBIO).
137
3 Chapter 4: Application of Machine Learning and Genetic
Optimization Algorithms for Modeling and Optimizing
Soybean Yield Using its Component Traits
Published in PLOS ONE Journal
Mohsen Yoosefzadeh-Najafabadi,1 Dan Tulpan,2 Milad Eskandari1
1. Department of Plant Agriculture, University of Guelph, Guelph, ON N1G 2W1,
Canada
2. Department of Animal Bioscience, University of Guelph, Guelph, ON N1G 2W1,
Canada
*Corresponding author. Email: [email protected]
Yoosefzadeh-Najafabadi, M., Tulpan, D., & Eskandari, M. (2021). Application of machine learning and genetic optimization algorithms for modeling and optimizing soybean yield using its component traits. Plos one, 16(4), e0250665. https://doi.org/10.1371/journal.pone.0250665
138
3.1 Abstract
Improving genetic yield potential in major food grade crops such as soybean (Glycine
max L.) is the most sustainable way to address the growing global food demand and its
security concerns. Yield is a complex trait and reliant on various related variables called
yield components. In this study, the five most important yield component traits in soybean
were measured using a panel of 250 genotypes grown in four environments. These traits
were the number of nodes per plant (NP), number of non-reproductive nodes per plant
(NRNP), number of reproductive nodes per plant (RNP), number of pods per plant (PP),
and the ratio of number of pods to number of nodes per plant (P/N). These data were
used for predicting the total soybean seed yield using the Multilayer Perceptron (MLP),
Radial Basis Function (RBF), and Random Forest (RF), machine learning (ML)
algorithms, individually and collectively through an ensemble method based on bagging
strategy (E-B). The RBF algorithm with highest Coefficient of Determination (R2) value of
0.81 and the lowest Mean Absolute Errors (MAE) and Root Mean Square Error (RMSE)
values of 148.61 kg ha-1, and 185.31 kg ha-1, respectively, was the most accurate
algorithm and, therefore, selected as the metaClassifier for the E-B algorithm. Using the
E-B algorithm, we were able to increase the prediction accuracy by improving the values
of R2, MAE, and RMSE by 0.1, 0.24 kg ha-1, and 0.96 kg ha-1, respectively. Furthermore,
for the first time in this study, we allied the E-B with the genetic algorithm (GA) to model
the optimum values of yield components in an ideotype genotype in which the yield is
139
maximized. The results revealed a better understanding of the relationships between
soybean yield and its components, which can be used for selecting parental lines and
designing promising crosses for developing cultivars with improved genetic yield
potential.
3.2 Introduction
Soybean (Glycine max L. Merrill) is the world’s most widely grown leguminous crop and
an important oil and protein source for food and feed (Hashiguchi and Komatsu, 2017).
Because of its nutritional and pharmaceutical properties, soybean is also considered as
an important source of healthy plant-based food products in the human diet. While the
global demand for soybean is increasing significantly (Miransari, 2016), the current
average annual genetic gain for yield seem not to be able to cope with the growing
demand (Wilson, 2012). One of the probable main reasons for this low genetic gain is the
inefficient selection criteria that are currently used in breeding programs for selecting
genotypes with desirable genetic yield potentials (Ramasubramanian and Beavis, 2020).
Despite the major advances in molecular technologies and their potential
implications in breeding programs, generating reliable phenotypic data and analyzing big
datasets have been remained the major bottlenecks (Rebetzke et al., 2019). Plant
breeding programs, including soybeans, are continuously relying on the evaluation of
yield and important agronomic traits for making selections and defining commercial
140
products (Yuan et al., 2002). If a trait is under controlled by a few major genes, and so
with limited environmental effects, designing molecular marker tools would be sufficient
for selecting desirable genotypes in a given breeding population (Tester and Langridge,
2010). However, for complex traits such as yield, which are highly influenced by the
environment and controlled by numerous genes with minor effects, the dissection of traits
underlying the yield can be beneficial for selecting genotypes with improved genetic
potentials (Araus and Cairns, 2014). In general, soybean breeders have made extensive
use of the classical phenotypic selection approaches to evaluate and exploit total seed
yield as the main selection criterion in their cultivar development programs (Araus and
Cairns, 2014; Qiu et al., 2018). However, genetic gains for yield have generally been low
and inconsistent across different environments (Araus and Cairns, 2014). The general
low genetic gains for yield can be to a large extent because of the nature of this trait, in
which several secondary traits drive the final production directly or indirectly (Wilson,
2012). Thus, one possible way to increase the yield genetic gains in new cultivars is to
improve their yield component traits (Tester and Langridge, 2010; Araus and Cairns,
2014; Rebetzke et al., 2019).
Yield formation in plant species is mostly governed by yield component traits
(Richards, 2000; Kenga et al., 2006; Robbins and Staub, 2009). Traits such as the
number of nodes per plant (NP), number of non-reproductive nodes per plant (NRNP),
number of reproductive nodes per plant (RNP), and number of pods per plant (PP) are
141
considered as the major yield components in soybean (Specht et al., 1999). Therefore,
the increase in yield production can be regulated by selecting the soybeans with greater
performance in their yield components (Specht et al., 1999; Kumudini et al., 2001).
Several studies have been conducted to describe yield improvement in plants via
improving yield components (Jiang et al., 2020b; Majhi et al., 2020; Sah et al., 2020;
Xavier and Rainey, 2020). For example, positive correlations between soybean yield and
the number of pods (Jin et al., 2010) were reported to be the most significant contributor
to yield gain in Northeast China (Liu et al., 2005). By considering yield components,
soybean breeders can model and predict optimum conditions in which the highest yield
production can achieve (Ma et al., 2001). However, developing reliable prediction models
built upon several yield component traits requires dealing with large datasets that are
generated from the evaluation of large breeding populations across multi-environments.
Due to the essential need for advanced skills in computational and mathematical
analyses, exploiting large datasets in many public breeding programs is still a bottleneck.
Machine Learning (ML) algorithms, as one of the reliable and efficient computational
approaches, were successfully implemented in different fields of study, such as traffic
crash frequency modeling (Zeng et al., 2016a; Zeng et al., 2016b), environmental science
(Maganathan et al., 2020), engineering (Sha et al., 2020), and medicine (Lee et al., 2020).
Previously, Zeng et al. (2016b) developed neural network, as one of the ML algorithms
for exploring the non-linear relationship between risk factors and crash frequency. They
142
reported the successfulness of using neural network algorithms to eliminate the overfitting
and deal with lack-box characteristic (Zeng et al., 2016b). Also, several studies reported
the efficiency of ML algorithms in better detection of genomic regions associated with a
trait of interest (Duan et al., 2005; Szymczak et al., 2009; Lee et al., 2020).
One of the recent agriculture trends is the use of ML algorithms for analyzing big
data (Liakos et al., 2018; Yoosefzadeh-Najafabadi et al., 2021b). Emerging ML algorithms
in agriculture have created new opportunities to quantify and understand the intensive
data process in agriculture (Crane-Droesch, 2018; Liakos et al., 2018). In a simple form,
ML algorithms can be defined as machines with the ability to learn without explicitly
programmed (McQueen et al., 1995; Liakos et al., 2018). Theoretically, each algorithm is
involved in a specific learning process from training data to perform a task of clustering,
predicting, and classifying new datasets using the knowledge attained during the learning
process (Crane-Droesch, 2018). Various ML algorithms have been developed and can
be implemented for complex interactions between features (McQueen et al., 1995; Araus
and Cairns, 2014; Liakos et al., 2018; Niazian and Niedbała, 2020). Multilayer Perceptron
(MLP), for example, is known as the common neural network algorithm that is widely used
in different areas such as plant sciences (Yoosefzadeh-Najafabadi et al., 2021b), remote
sensing (Zhang et al., 2018a), environmental sciences (Wang and Gao, 2018), and
engineering (Yilmaz and Kaynar, 2011). Like other neural network algorithms, MLP is built
upon many neurons in which each neuron has its own specific weight (Bhojani and Bhatt,
143
2020). In any case that one neuron is insufficient to explain the algorithm, MLP will be
useful by providing multi-neurons (Araghinejad, 2013; Deore et al., 2020). Radial Basis
Function (RBF) is another ML algorithm commonly used in plant sciences (Heddam et
al., 2011; Chouhan et al., 2018; Hesami et al., 2019a). RBF is reported to be successful
for predictions wherever relevant features are used (Hesami et al., 2019a). However, its
performance for predicting soybean yield from its components is still unknown. Random
Forest (RF) is another algorithm that its performance was evaluated in this study. RF has
drawn many researchers’ attention because of adequate performances in various fields,
including plant science (De Castro et al., 2018; Yoosefzadeh-Najafabadi et al., 2021b),
animal science (Alsahaf et al., 2018; Tulpan, 2020), human science (Acharjee et al.,
2017), and remote sensing (Pal, 2005).
One of the major impediments of using individual ML algorithms is the high
probability of overfitting in single predictive algorithms (Feng et al., 2020b). To overcome
this obstacle, the ensemble techniques can be employed (Dietterich, 2000). Ensemble
techniques are known as the most influential development in the application of ML
algorithms (Seni and Elder, 2010), in which combined algorithms are exploited to improve
prediction accuracies by reducing overfitting rates (Dietterich, 2000; Seni and Elder, 2010;
Feng et al., 2020b). Three commonly used ensemble algorithms are stacking, boosting,
and bagging methods that are used according to the nature of the dataset and the
individual ML algorithms that are used (Seni and Elder, 2010). The success of using
144
ensemble techniques was reported in different areas such as plant science
(Yoosefzadeh-Najafabadi et al., 2021b), engineering (Aanonsen et al., 2009), and
computer sciences (Wang et al., 2012).
In soybean cultivar development programs, an optimum selection among important
yield components can significantly improve the yield genetic gain. Therefore, the
implementation of optimization methods in this field is of particular interest. Optimization
algorithms for improving important traits are becoming more and more attractive in plant
science (Hesami et al., 2019a; Hesami and Jones, 2020c). Genetic Algorithm (GA) is
known as one of the most well-known single objective optimization methods designed
and developed by Holland (1992), as a searching algorithm based on natural selection.
GA searching algorithm is based on Darwin’s notion that more stable organisms across
different environments survive better than the others (Yun et al., 2020).
Although the successful uses of ML ensemble methods have been reported in
different agriculture-related fields (Feng et al., 2020b; Hesami and Jones, 2020c; Hesami
et al., 2020c; Hesami et al., 2020d; Yoosefzadeh-Najafabadi et al., 2021b), the potential
use of these algorithms to predict soybean yield using yield components remains
unknown. Therefore, this study aimed to investigate the potential use of soybean yield
components for predicting the final seed yield using individual ML algorithms as well as
ensemble learning methods. In addition, linking the best machine learning algorithm with
GA for estimating the optimized values of the yield components for maximizing the
145
soybean yield was investigated. The outcomes of this study can pave the way to
understand the importance of soybean yield components in determining the total seed
yield and implement the proposed pipeline for making the genotypic selection more
accurate.
3.3 Material and methods
3.3.1 Plant material and experimental design
Two hundred and fifty soybean genotypes were grown under field conditions at two
locations: Palmyra (42°25'50.1"N 81°45'06.9"W, 195 m above sea level) and Ridgetown
(42°27'14.8"N 81°52'48.0"W, 200m above sea level), in Ontario, Canada in 2018-2019.
The population used in this study was selected from the core germplasms of soybean
breeding programs at the University of Guelph, Ridgetown campus that have been
created over 35 years and used for cultivar development and genetic studies. The full
description of the tested genotypes was presented in Table 2.1. The experiment was
conducted using randomized complete block designs (RCBD) with two replications in four
environments (two locations × two years), consisting of 2000 phenotypic plots in total.
Each plot consisted of five rows, each 4.2 m long and 40 cm spacing between each row.
The seeding rates used in this study were 50-57 seeds per m2.
146
3.3.2 Phenotypic evaluations
Seed yield (t ha-1) of each plot was measured by harvesting three middle rows and
adjusting to a 13% moisture level and maturity. At the R8 full maturity stage, soybean
yield components, including NP, NRNP, RNP, and PP, were hand-measured by randomly
selected ten plants from each phenotypic plot for each genotype in each environment.
Also, the PP to NP ratio (P/N) for each genotype was calculated using the following
equation:
𝑃/𝑁 =PP
NP (Eq 4.1)
where PP indicates the number of pods per plant, and NP indicates the total number of
nodes per plant.
3.3.3 Data pre-processing, correlation coefficient, and statistical analyses
All the phenotypic plot-based data were adjusted for spatial variations within the fields
using nearest neighbor analysis (NNA) as one of the spatial error control methods to
reduce and minimize the possible error in the field data (Stroup and Mulitze, 1991;
Katsileros et al., 2015b). The yield components were collected for 250 soybean
genotypes from 2000 plots across four environments. The average value of each yield
component for each genotype was estimated through the best linear unbiased prediction
(BLUP) as a mixed model (Robinson, 1991). For BLUP analysis, the environment factor
147
was selected as a fixed effect, and the genotype factor was considered as a random
effect. Afterward, all 250 data-points were used for constructing training and testing
datasets. In this study, all the yield component traits such as NP, NRNP, RNP, PP, and
P/N were considered as input variables for predicting the soybean yield as the output
variable. In order to improve the prediction accuracy of input variables, data scaling and
centering were applied for the pre-processing and pre-treatment steps (Rossel, 2008).
Before performing ML algorithms, the principal component (PC) analysis was applied to
identify outliers; however, no outlier was detected. The Pearson coefficient of correlations
between seed yield and yield components were estimated using the R software version
3.6.1.
3.3.4 Data-driven modeling
MLP, RBF, and RF are the most commonly used ML algorithms with distinct functions
and abilities that are used for predicting yield in plant crop species (Bhojani and Bhatt,
2020; Feng et al., 2020b; Geetha, 2020). The MLP, as one of the most common feed-
forward neural networks, consists of input, hidden, and output layers of interconnected
neurons (Gardner and Dorling, 1998). RBF is another type of neural network that used
approximate multivariate functions (Orr, 1996). RBF has the same functionality as that of
MLP but in an effective way for using in more than one dimension (Wilamowski and
Jaeger, 1996). RF is also another commonly used ML algorithm that generates combined
trees representing n number of independent observations (Liaw and Wiener, 2002). The
148
final prediction in RF is determined based on the average predictions of all possible
independent trees (Feng et al., 2020b). In order to improve the prediction performance of
individual ML algorithm, an ensemble method was proposed based on the bagging
strategy (E-B). The steps for the E-B analyses are as follows: (1) applying and training
MLP, RBF, and RF, independently; (2) selecting the best ML algorithm, based on the
validation set, as the metaClassifier for the E-B; and (3) combining the prediction results
in order to improve the prediction accuracy of the soybean yield (Feng et al., 2020a). All
the used parameters in each algorithm were optimized based on the training dataset. The
Weka software version 3.9.4 (Hall et al., 2009) was used to run all the tested ML
algorithms and the E-B method for analyzing training and validation sets.
3.3.5 Optimization process via GA
For optimizing the values of NP, PP, RNP, NRNP, and P/N in a theoretical maximized
yield ideotype genotype, the best ML algorithm was linked to the genetic algorithm (GA).
In order to have the best implement of GA, some parameters such as mutation rate,
crossover fraction, and the number of chromosomes should be determined. For that,
optimum values of the mentioned parameters are estimated by the trial and error
methods. In brief, a chromosome is known as a set of variables that define as a possible
solution to the problem. In this study, all the possible combinations of the yield component
traits were considered as different chromosomes to form the maximum soybean yield
production. All proposed chromosomes are constructed from the initial population, which
149
is the first step in the GA optimization process (Hesami and Jones, 2020c). Crossover in
GA is the process of creating a new generation of chromosomes by mixing the existing
chromosomes (Wang, 2003). In this study, the possible crossover between two
chromosomes, each containing a combination pattern of the yield components, was
evaluated by using a two-point crossover, which is known as one of the common
crossover fractions. Mutation is another parameter in GA, which is used to control the
local minima in the population (Wang, 2003). By using a mutation rate, the possibility of
having similar chromosomes will be decreased, and therefore, the possible local minima
are decreased (Hesami and Jones, 2020c). In this study, the mutation and crossover
rates as well as the number of chromosomes were set to 0.1, 0.7, and 100, respectively
(Figure 4.1). Also, the Roulette wheel was used for selecting elite populations for
crossover to obtain the appropriate fitness. In order to achieve the generation number,
the generational practice was iteratively implemented. The lower and upper bounds of the
dataset were considered as the constraints in the optimization process (Figure 4.1).
3.3.6 Quantification of model performance and error estimations
The original dataset consists of 250 observations was randomly divided into training and
validation sets based on the five k-fold cross-validation method (Siegmann and Jarmer,
2015) with ten repetitions (Figure 4.2). To quantify the performance of the ML algorithms
for predicting soybean seed yield from the yield components, the following statistical
measurements between independent reference values (Y) and estimated values (Y′)
150
were applied: The Root Mean Square Error (RMSE, Eq 4.2) and the Mean Absolute Errors
(MAE, Eq 4.3) of the validation set.
𝑅𝑀𝑆𝐸 = √∑(Y′−𝑌)2
𝑛 (Eq 4.2)
𝑀𝐴𝐸 =∑ |Y′
𝑖−𝑌𝐼|𝑛𝑖=1
𝑛 (Eq 4.3)
where Y′ stands for predicted value, Y is the field measured value, and n is the number
of observations for a given genotype.
The analyses of the goodness of fit between observed and the predicted values
were performed using the coefficient of determination (R2, Eq 4.4). While we provide their
definitions below, more comprehensive descriptions and definitions can be found in
(Taylor, 1997; Cacuci et al., 2005; Farifteh et al., 2007):
𝑅2 =𝑆𝑆𝑇−𝑆𝑆𝐸
𝑆𝑆𝑇 (Eq 4.4)
where SST stands for the sum of squares for total, and SSE stands for the sum of the
squares for error.
3.3.7 Visualizing and statistical analyzing
The results were visualized using the Microsoft Excel software (2019), ggvis (Chang and
Wickham, 2016), and ggplot2 (Wickham et al., 2016) packages in the R software version
151
3.6.1. All statistical computational procedures such as calculating BLUP and
preprocessing data were also conducted using MASS (Ripley et al., 2013) package in R
software.
3.4 Results
3.4.1 Pearson correlation analyses and individual ML evaluations
Based on the field data analyses, the average yield of 250 soybean genotypes was
estimated to be between 2.58 to 5.71 t ha-1 with a mean and standard deviation of 4.22
and 0.57 t ha-1, respectively.
The potential benefits of using each soybean yield component for predicting the
soybean seed yield was quantified using the Pearson coefficients of correlation among
all the measured traits. Based on the correlation coefficients, all the yield components,
except NRNP, were positively correlated with soybean seed yield. The linear correlation
between soybean seed yield and PP (r = 0.71) was found to be the strongest followed by
its correlation with NP (r =0.68), RNP (r =0.67), and P/N (r =0.64). The negative
correlation between soybean seed yield and NRNP was estimated as r = -0.29 (Figure
4.3).
Based on the results of correlation analyses, the top correlated variables were
iteratively added to the algorithms and updated the algorithm training performances until
152
all variables were included. The R2 values of each model were then calculated (Figure
4.4). Among all the tested ML algorithms, the R2 reached its maximum value of 0.81 in
RBF taking into account PP, NP, and RNP. No changes in R2 value were detected after
adding P/N and NRNP in the RBF algorithm. The maximum R2 value of 0.80 was obtained
for both RF and MLP when all the yield components are added into the algorithms (Figure
4.4).
3.4.2 Model performance and evaluation
The full analysis of ML algorithms are presented in the Table 4.1. Among the three tested
ML algorithms, RBF had the highest value for R2 (0.81) and the lowest values for MAE
(148.61 kg ha-1) and RMSE (185.31 kg ha-1) (Figure 4.5 A-C). The R2 values for MLP and
RF were the same (0.80); however, they had different values for MAE and RMSE. In
comparison with the MLP algorithm, RF had the lower MAE and RMSE with 156.28 kg
ha-1 and 194.75 kg ha-1, respectively (Figure 4.5 A-C). MLP had the highest MAE (172.98
kg ha-1), and RMSE (211.57 kg ha-1) among the ML algorithms. In addition to individual
evaluations of the three ML algorithms, an ensemble learning was also developed, which
outperformed all the individual machine learning algorithms, attaining an R2 value of 0.82.
The E-B method had the acceptable RMSE and MAE with a value of 184.35 kg ha-1 and
148.37 kg ha-1, respectively (Figure 4.5 A-C). In general,the R2 value of E-B increased
by 0.1, while the values of MAE and RMAS decreased by 0.24 kg ha-1 and 0.96 kg ha-1,
153
respectively, in comparison with RBF as the most accurate individual ML algorithm
identified in this study.
3.4.3 Optimization of the soybean seed yield using E-B-GA
The aim of the current study, not only was to predict soybean seed yield using yield
components, but also to estimate the optimum values of these traits, i.e., NP, PP, RNP,
NRNP, and P/N, to maximize the final yield production in a given genotype. The results
of the optimization process using GA, as the single objective evolutionary optimization
algorithm, are presented in Table 4.2. Theoretically, the highest soybean seed yield
production of 5.64 t ha-1 can be achieved in an ideotype soybean genotype in which the
values of NP, NRNP, RNP, PP, and P/N are 17.32, 3.07, 14.25, 48.98, and 2.83,
respectively.
3.5 Discussion
One of the objectives of this study was to investigate the potential use of soybean yield
components such as NP, PP, RNP, NRNP, and P/N for predicting the final seed yield
production. Many studies reported the reliance of the final yield production on the
performance of several yield-related traits (Ciampitti and Vyn, 2012; Dutamo et al., 2015;
O’Connor et al., 2018; Cao et al., 2020; Xavier and Rainey, 2020). In soybean, PP and
NP are considered as major players in determining the final seed yield (Egli, 2005; 2013;
154
Xavier and Rainey, 2020). In the current study, PP showed the highest linear correlation
with the total seed yield. The direct impact of the number of pods per plant on the final
soybean yield is also reported by Bastidas et al. (2008)]. Among all the tested yield
component traits in this study, NP showed the second-highest linear correlation with total
seed yield and showed a positive correlation with PP. Many studies reported that the
variations in the number of nodes per plant is usually accounted for the changes in the
number of pods per plant (Egli, 2005; Bastidas et al., 2008; Egli, 2013; Wei and Molin,
2020). A negative correlation between the total soybean seed yield and NRNP was found
in this study. It is in agreement with previous studies claimed that increasing the number
of non-reproductive nodes decreases the reproductive potential of soybean seed yield
(Egli, 2013; Du et al., 2019). The results of linear correlation analyses in this study
illustrated the importance of individual yield component traits in determining the total
soybean seed yield.
Conventional statistical methods such as ANOVA and simple regression methods
are typically recommended for small datasets with limited dimensions (Homack, 2001;
Rutherford, 2001). However, soybean yield is a complex trait under controlled by different
continuous variables called yield component traits. Therefore, more sophisticated
methods are required for analyzing large data sets with high dimensions (Vapnik, 2013).
The successful uses of ML algorithms for analyzing big data with multi-collinearity among
the variables have recently been reported in many plant species such as soybean
155
(Yoosefzadeh-Najafabadi et al., 2021b), alfalfa (Feng et al., 2020b), chrysanthemum
(Hesami et al., 2019b), wheat (Montesinos-López et al., 2017a), and lime (Jafari and
Shahsavar, 2020). The prediction performance of a given machine learning algorithm
refers to the power of the model in predicting the values of a dependent variable when
non-representative samples, or samples from a different population, are used as the test
population (Rocha et al., 2018). The prediction performance of an ML algorithm is
estimated using its R2, RMSE, and MAE values (James et al., 2013; Kuhn and Johnson,
2013; Rocha et al., 2018). In this study, the three common ML algorithms, RBF, MLP,
and RF, were used to predict the soybean seed yield using its components and their
prediction performance were estimated. RBF was found to be the most accurate ML
algorithm for predicting the soybean seed yield from its component traits. In general, yield
components in soybean are traits with low heritability that are influenced by environmental
factors. The environmental factors can bring some levels of instability/noise in the results
of all the ML analyses (Xavier et al., 2016a). However, the structure of RBF (Figure 4.6)
gives some level of robustness against the adversarial noises, compared to other tested
ML algorithms (Chenou et al., 2019; Langenberg et al., 2019). The specific structure of
RBF is the use of the transfer function of input variables to hidden layer name radial basis
function (Orr, 1996; Bawazeer et al., 2021). This function plays an important role in
reducing noises of input variables resulted in more accurate prediction performance
(Jiang et al., 2016).
156
Although RF and MLP had the same R2 values, the MAE and RMSE values were
lower in RF. MLP, as one of the neural network algorithms, is highly susceptible to
possible instabilities/noises caused by non-heritable factors. The MAE and RMSE values
of this algorithm were the highest among all the tested ML algorithms that may indicate
the sensitivity of the algorithm to noises. As a result, using this algorithm may not
recommend for analyzing phenotypic data that are largely affected by environmental
factors. RF with the R2, MAE, and RMSE values of 0.80, 156.28 kg ha-1, and 194.75 kg
ha-1, respectively, was selected as the second-best ML algorithm for predicting soybean
seed yield in this study. Although the difference between RF and RBF in terms of R2
values was not statistically significant (data are not shown), they were statistically different
for their MAE and RMSE values. The performance of the RF algorithm relies on
processing large dimensional data based on generalized error estimation (Rodriguez-
Galiano et al., 2015; Zhang et al., 2020). Also, there is no assumption requirement for RF
about the distribution of data (Melesse et al., 2020), and this algorithm can isolate outliers
in a small region of the variable space resulted in acceptable performance against
nonlinear environmental effects (De'ath and Fabricius, 2000; Melesse et al., 2020).
In addition to individual comparison of the three tested ML algorithms, we developed
a bagging ensemble model by combining the results of RBF, MLP, and RF in this study.
Since the RBF had the highest performance in predicting the soybean seed yield, this
algorithm was chosen as the metaClassifier for developing the E-B algorithm. Using the
157
E-B model, a slight improvement was obtained in predicting the total soybean seed yield
from its component traits. Diversity and sufficiency are two of the most important
principles in selecting base learners for an ensemble model (Feng et al., 2020a). It means
that the dependency among the used ML algorithms in the ensemble model should be
minimized, while each based learner should have an acceptable predicting capability as
well (Araya et al., 2017; Zhang et al., 2019b). Therefore, we selected different ML
algorithms as the base learners for the E-B with different training data mechanisms. Also,
the performance of the E-B was optimized by implementing RBF as the metaClassifier
for this model. The effectiveness of using ensemble models was reported in different plant
species such as chrysanthemum (Hesami et al., 2020a), sorghum (Kosmowski and
Worku, 2018), wheat (Tian et al., 2011b), alfalfa (Feng et al., 2020a), and brassicas (Qi,
2012). This study demonstrates the benefit of using the RBF-based E-B approach to
improve the soybean yield prediction accuracy using yield components.
Selecting high-yielding lines has always been one of the major goals in plant
breeding programs that can be performed using either direct or indirect selection
approaches (Slinkard et al., 2000). Selecting superior genotypes based on the yield
component traits can be considered as an indirect method. An analytical breeding
strategy is an alternate breeding approach that is focused on the improvement of
components of complex traits such as plant growth and development rates or yield
components (Acevedo and Aleppo, 1991) rather than the traits per se. This strategy can
158
improve genetic yield potential in varieties by setting up selection criteria on yield
components and making the selection more efficient (Reynolds, 2001). In order to move
toward analytical breeding, it would be important to have the optimized level of each yield
component traits in target populations. Genetic algorithm is commonly used in finding
optimized solutions by searching problems through biological parameters such as
selection, crossover, and mutation (Dasgupta et al., 2013; Hesami and Jones, 2020c).
After selecting E-B as the combined algorithm with the highest prediction ability in this
study, GA was linked into this algorithm to estimate the optimum values of the yield
component traits (Table 1). The effectiveness of using the GA algorithm for estimating
optimized solutions was reported previously in plant tissue culture (Hesami et al., 2019b),
plant physiology (Halim and Ismail, 2013), and remote sensing (Wu et al., 2017).
3.6 Acknowledgments
The authors are grateful to the past and current members of Eskandari laboratory at the
University of Guelph, Ridgetown Campus, including Dr. Sepideh Torabi, Mr. Bryan
Stirling, Mr. John Kobler, and Mr. Robert Brandt for their technical support. We would also
like to thank Mrs. Maryam Vazin for her assistance with the field data collection.
159
Table 3.1 Analysis performance of Random Forest (RF), Multilayer Perceptron (MLP), and Radial Basis Function (RBF) algorithms, and the Ensemble-Bagging (E-B) strategy for soybean yield prediction using yield component traits such as The number of nodes per plant (NP), number of non-Reproductive nodes per plant (NRNP), number of reproductive nodes per plant (RNP), number of pods per Plant (PP), and ratio of number of pods to number of nodes per plant (P/N).
Algorithm MAE (Kg ha-1) RMSE (Kg ha-1) Coefficient of determination (R2)
RF 178.7628 221.3154 0.766184
RF 149.9257 190.1973 0.824873
RF 195.2948 244.9381 0.783472
RF 158.4371 186.9633 0.545133
RF 162.9635 197.5527 0.820895
RF 141.4723 177.8491 0.892779
RF 136.1359 175.7995 0.87027
RF 190.0622 232.3531 0.692505
RF 138.5809 176.5637 0.804284
RF 159.3969 182.2693 0.812292
RF 148.2689 182.6683 0.900656
RF 167.9388 207.3602 0.800588
RF 154.2294 187.3587 0.714045
RF 168.8059 206.1508 0.836931
RF 143.3354 185.6513 0.825753
RF 164.4261 193.7816 0.726271
RF 145.82 168.8407 0.930298
160
RF 148.9438 205.5292 0.75625
RF 157.872 203.142 0.736331
RF 146.8995 196.8535 0.724496
RF 177.6075 222.6089 0.87656
RF 143.2936 185.0717 0.747739
RF 130.328 159.4461 0.889543
RF 133.714 171.697 0.890703
RF 188.8504 218.2928 0.708521
RF 161.8679 196.3354 0.825907
RF 153.3627 191.2781 0.772181
RF 160.157 205.1084 0.857359
RF 200.2354 236.3703 0.61441
RF 127.769 169.0819 0.995297
RF 182.285 220.6546 0.856008
RF 149.9078 177.8497 0.749269
RF 148.4835 184.1446 0.738688
RF 142.302 187.7328 0.871173
RF 163.3174 214.4299 0.909112
RF 196.0316 242.842 0.709007
RF 150.1975 188.1803 0.807753
RF 150.36 186.1467 0.867454
RF 142.3568 173.0309 0.768005
161
RF 140.9918 177.5659 0.818197
RF 148.6688 186.3253 0.786482
RF 148.4959 186.2586 0.863586
RF 138.4413 164.3384 0.81034
RF 170.5698 214.3142 0.829237
RF 165.9738 204.0655 0.76936
RF 175.2973 212.9283 0.732842
RF 154.5922 199.6483 0.779811
RF 164.71 213.779 0.868679
RF 141.0413 191.7986 0.829673
RF 163.0976 191.7426 0.738629
RF 148.0027 178.9045 0.836692
RF 181.4252 223.3686 0.784269
RF 152.5579 189.5512 0.82343
RF 170.461 207.7199 0.734849
RF 141.7312 176.5751 0.735335
RF 157.3465 200.8962 0.869518
RF 145.9078 187.7716 0.875962
RF 181.4088 213.8097 0.688733
RF 136.9581 170.6061 0.817263
RF 128.4578 167.8907 0.833361
RF 134.8502 184.0278 0.898587
162
RF 153.402 200.3877 0.797845
RF 141.3659 173.3029 0.876889
RF 135.4279 167.1626 0.757907
RF 158.5733 183.4167 0.813396
RF 182.4223 227.7454 0.929062
RF 156.8169 202.0108 0.752552
RF 189.7572 230.0217 0.610582
RF 147.8972 184.1388 0.844133
RF 150.5905 210.643 0.790229
RF 170.2709 212.5736 0.735781
RF 166.4571 199.5631 0.883256
RF 139.4238 178.0832 0.800048
RF 157.0009 196.2237 0.860542
RF 135.3621 169.3442 0.844932
RF 178.0886 222.583 0.77273
RF 144.9649 185.4615 0.691444
RF 174.6506 210.4431 0.764172
RF 163.1946 200.6054 0.807037
RF 115.3843 153.158 0.890578
RF 168.763 193.9537 0.838
RF 159.3417 214.9731 0.646471
RF 160.0949 202.2082 0.783883
163
RF 181.7538 218.347 0.737263
RF 151.244 185.6695 0.820413
RF 166.8557 192.8271 0.893896
RF 132.159 169.5062 0.70914
RF 131.7293 160.3757 0.893691
RF 133.2249 174.1565 0.762251
RF 193.9929 236.8308 0.816991
RF 165.0995 197.0843 0.769782
RF 140.4964 183.8125 0.936378
RF 154.0391 182.249 0.869982
RF 109.9387 158.5995 0.917128
RF 163.8293 201.346 0.657122
RF 139.5893 181.1232 0.857129
RF 154.5968 191.6562 0.713304
RF 155.9699 193.7374 0.783702
RF 203.071 239.3553 0.486035
RF 150.6367 193.4698 0.698959
MLP 172.7595 217.5933 0.780432
MLP 200.6261 252.9527 0.801682
MLP 169.3347 218.3947 0.857704
MLP 166.7933 203.4841 0.602116
MLP 144.2937 190.6019 0.848869
164
MLP 143.857 170.3943 0.753572
MLP 126.8736 164.798 0.705991
MLP 231.2917 284.9384 0.828591
MLP 226.1779 268.8094 0.831738
MLP 231.1072 269.5651 0.791386
MLP 201.0398 246.1417 0.881213
MLP 167.2376 197.22 0.866975
MLP 158.2354 201.3958 0.737549
MLP 190.92 218.587 0.86469
MLP 148.1788 186.9506 0.842144
MLP 146.7581 191.6821 0.756363
MLP 137.9938 157.613 0.755116
MLP 208.8802 241.2391 0.786219
MLP 159.0894 209.0648 0.815781
MLP 145.0941 203.2909 0.744231
MLP 187.699 229.2773 0.86999
MLP 216.459 261.1971 0.756014
MLP 135.9898 177.9439 0.864208
MLP 235.367 262.5487 0.804906
MLP 164.5036 187.6228 0.774705
MLP 169.4086 217.1101 0.850425
MLP 135.6038 160.3304 0.853235
165
MLP 174.78 196.1875 0.716394
MLP 161.6382 196.0674 0.706674
MLP 98.44332 122.9621 0.842102
MLP 186.9764 245.0146 0.822978
MLP 109.0222 137.5278 0.770514
MLP 166.0958 207.0647 0.796704
MLP 202.1948 229.1342 0.828122
MLP 201.4795 242.5588 0.786087
MLP 235.1983 288.2606 0.73038
MLP 153.1835 184.1012 0.82238
MLP 155.4043 194.6755 0.769831
MLP 261.0962 304.2983 0.783093
MLP 152.1932 183.7509 0.828472
MLP 161.1597 212.5492 0.778382
MLP 118.8771 164.4517 0.768988
MLP 132.8556 157.3079 0.77531
MLP 190.1359 233.156 0.862115
MLP 167.8941 207.3835 0.783019
MLP 168.872 205.7792 0.74932
MLP 151.1987 195.7471 0.780283
MLP 170.4522 198.8213 0.914687
MLP 146.5039 196.3535 0.828794
166
MLP 130.429 157.1896 0.831685
MLP 148.6512 188.2507 0.825356
MLP 173.4913 212.3744 0.806248
MLP 168.5609 211.869 0.793968
MLP 166.3879 196.8307 0.774842
MLP 145.2993 188.9879 0.716266
MLP 149.7417 188.7923 0.889262
MLP 132.4694 158.6575 0.902521
MLP 169.8295 205.7533 0.7742
MLP 195.2361 240.853 0.809339
MLP 165.2936 188.3967 0.720568
MLP 137.1391 179.4708 0.711557
MLP 343.7492 391.8631 0.813609
MLP 133.5503 162.2748 0.866505
MLP 131.7657 164.7783 0.748885
MLP 136.6483 178.058 0.847794
MLP 177.7942 216.2273 0.755567
MLP 144.0417 175.1276 0.844891
MLP 195.2948 251.2979 0.622138
MLP 143.8909 180.552 0.887983
MLP 156.2171 211.643 0.807022
MLP 197.7098 233.4665 0.766644
167
MLP 168.9159 202.061 0.905567
MLP 202.618 246.1472 0.824288
MLP 209.1927 254.7927 0.846966
MLP 171.9806 214.3512 0.841538
MLP 156.5163 195.6712 0.827064
MLP 189.4516 219.3236 0.68501
MLP 202.7257 239.2442 0.807838
MLP 222.0311 249.4734 0.86223
MLP 164.8277 187.0467 0.892593
MLP 163.8334 198.9956 0.856719
MLP 142.352 190.0295 0.739342
MLP 185.9694 234.6922 0.834398
MLP 207.9813 253.4011 0.761064
MLP 242.326 292.8608 0.854342
MLP 142.567 174.978 0.707284
MLP 175.0444 203.6102 0.753441
MLP 154.5905 180.7022 0.88873
MLP 175.7215 230.6778 0.810041
MLP 181.9961 221.9387 0.834332
MLP 204.4311 253.3968 0.780615
MLP 121.9059 150.4393 0.959347
MLP 176.2471 203.9698 0.838498
168
MLP 220.3621 245.4573 0.918385
MLP 143.5251 185.0442 0.704066
MLP 193.3929 247.4376 0.858957
MLP 133.1788 169.0064 0.751028
MLP 141.1103 169.8037 0.83664
MLP 215.6617 270.1576 0.505873
MLP 257.6033 296.671 0.715667
RBF 183.6138 233.221 0.742965
RBF 154.2808 193.7745 0.820005
RBF 143.5776 212.9047 0.830351
RBF 151.1534 180.7979 0.579802
RBF 142.3292 173.993 0.855718
RBF 143.5067 172.3785 0.900257
RBF 129.8851 170.7283 0.879767
RBF 158.327 188.5379 0.80311
RBF 137.4389 169.0885 0.818078
RBF 142.8674 173.7702 0.834642
RBF 155.428 208.4045 0.859551
RBF 127.8367 167.6985 0.868803
RBF 155.2717 187.98 0.72971
RBF 159.2594 196.2965 0.851342
RBF 145.7332 186.4238 0.823136
169
RBF 140.1722 181.7567 0.762889
RBF 136.9643 158.6364 0.935984
RBF 161.9318 197.444 0.771353
RBF 150.1286 189.948 0.789082
RBF 157.2294 215.6805 0.706224
RBF 163.7579 213.0716 0.881643
RBF 136.7781 173.1275 0.781717
RBF 129.7098 171.1569 0.869125
RBF 138.8365 175.307 0.875032
RBF 151.6126 178.6148 0.795283
RBF 158.8418 199.1415 0.843254
RBF 126.6348 157.6121 0.840598
RBF 142.4766 187.3825 0.710544
RBF 200.4364 229.3553 0.677522
RBF 103.9961 129.3797 0.931323
RBF 141.2857 169.9702 0.921905
RBF 140.7124 168.8831 0.782043
RBF 137.1646 168.7338 0.782994
RBF 133.134 164.1681 0.922924
RBF 178.4739 230.8923 0.768754
RBF 190.426 233.797 0.74503
RBF 133.7204 171.3819 0.808639
170
RBF 154.9224 196.7903 0.871257
RBF 150.331 172.045 0.775318
RBF 135.5637 166.4849 0.843777
RBF 142.8767 181.9715 0.795573
RBF 127.5752 161.77 0.895073
RBF 136.9867 169.1653 0.743812
RBF 168.7578 204.0289 0.854152
RBF 162.0003 197.4827 0.784401
RBF 176.6895 222.8623 0.717401
RBF 159.6183 206.0398 0.759635
RBF 151.3934 177.5788 0.911078
RBF 154.743 200.9931 0.818175
RBF 141.6372 172.916 0.801914
RBF 153.4421 198.3745 0.813678
RBF 162.2073 196.0599 0.836618
RBF 161.3724 199.8997 0.812045
RBF 170.6379 199.1354 0.762521
RBF 153.5326 187.0526 0.690971
RBF 144.5773 181.9399 0.892574
RBF 129.3503 166.7908 0.908665
RBF 150.9481 183.4863 0.77557
RBF 143.3755 182.2115 0.812386
171
RBF 101.1158 132.3554 0.909606
RBF 116.2528 159.9406 0.929833
RBF 150.3993 196.5136 0.812119
RBF 142.125 168.2098 0.870993
RBF 133.3209 165.6612 0.743889
RBF 148.696 170.7859 0.844676
RBF 177.7905 215.8261 0.755023
RBF 150.9376 181.0354 0.808973
RBF 181.3793 239.2106 0.597383
RBF 125.9981 158.3065 0.885233
RBF 159.148 207.9001 0.79668
RBF 168.702 209.1337 0.734244
RBF 150.6802 180.0562 0.913315
RBF 130.665 168.7718 0.834703
RBF 176.3399 211.3994 0.845549
RBF 130.1258 164.8486 0.855814
RBF 158.3807 194.6542 0.82766
RBF 149.738 187.2562 0.688145
RBF 155.8235 196.5279 0.806485
RBF 135.5634 172.8242 0.850478
RBF 118.5061 150.6716 0.892534
RBF 163.792 192.0888 0.85329
172
RBF 162.8707 206.0287 0.677389
RBF 153.2241 183.6246 0.827782
RBF 169.817 207.3453 0.768112
RBF 148.8906 178.6115 0.838454
RBF 151.5278 177.2087 0.8991
RBF 117.3948 152.5232 0.75611
RBF 143.7546 174.0319 0.882123
RBF 107.5289 157.072 0.81195
RBF 189.9371 224.5631 0.831525
RBF 150.3458 190.7095 0.783228
RBF 124.2933 153.913 0.949476
RBF 168.0047 193.4269 0.85032
RBF 123.3812 159.9852 0.923831
RBF 145.5034 182.8422 0.692957
RBF 153.3721 185.3318 0.851531
RBF 150.0027 190.8469 0.69351
RBF 134.2136 167.6947 0.835008
RBF 187.8876 233.8888 0.529072
RBF 136.4447 181.0933 0.734279
E-B 183.2795 226.4269 0.760125
E-B 158.7388 197.9632 0.811211
E-B 138.9816 195.5484 0.85706
173
E-B 157.427 184.4039 0.566074
E-B 153.1076 182.1446 0.842777
E-B 135.9 166.1241 0.909051
E-B 125.8501 166.0372 0.883882
E-B 153.9751 185.6232 0.806269
E-B 132.8204 162.7451 0.830713
E-B 152.0976 182.3546 0.809535
E-B 152.3521 197.8197 0.870155
E-B 132.6661 172.5216 0.861593
E-B 152.6118 179.918 0.734378
E-B 157.1289 191.2173 0.858173
E-B 151.6467 187.7465 0.821321
E-B 156.2297 189.5207 0.739623
E-B 134.4548 158.9776 0.93652
E-B 156.8858 189.8088 0.789772
E-B 151.1254 189.4081 0.793575
E-B 137.703 194.9087 0.727458
E-B 167.3076 221.0777 0.877
E-B 136.2339 170.8039 0.787356
E-B 128.6773 170.656 0.873827
E-B 138.0543 176.1757 0.874227
E-B 154.1638 183.4862 0.783918
174
E-B 154.8244 193.3933 0.844108
E-B 136.871 164.7964 0.832547
E-B 158.5593 200.0827 0.671116
E-B 192.5827 218.3656 0.707427
E-B 102.409 127.7135 0.934456
E-B 132.2051 170.6688 0.927864
E-B 140.9156 168.9344 0.777631
E-B 136.3336 165.9878 0.789646
E-B 133.3576 164.0796 0.922245
E-B 182.7726 231.2405 0.76968
E-B 195.9902 236.992 0.744964
E-B 137.6149 174.0405 0.810837
E-B 143.9593 189.4716 0.870914
E-B 148.7182 172.3549 0.771355
E-B 134.8865 168.0054 0.839248
E-B 150.7692 193.0609 0.771416
E-B 126.5058 159.9068 0.898655
E-B 129.4275 161.363 0.76353
E-B 168.6478 203.2126 0.855768
E-B 156.2701 198.9763 0.776954
E-B 173.3665 215.7703 0.732595
E-B 157.8172 200.3769 0.777714
175
E-B 152.7948 199.1369 0.900538
E-B 146.3063 194.6961 0.819759
E-B 135.8349 165.4279 0.814491
E-B 148.6061 193.467 0.811268
E-B 164.4626 209.4214 0.817501
E-B 166.5085 208.0148 0.800263
E-B 160.7459 193.2639 0.769775
E-B 152.2019 182.9039 0.707946
E-B 147.0838 185.9628 0.888372
E-B 125.6281 164.7244 0.914783
E-B 153.7599 186.1348 0.769928
E-B 146.1042 184.4388 0.808546
E-B 99.57982 134.1271 0.918512
E-B 123.7468 182.1538 0.911804
E-B 149.7901 193.9552 0.815593
E-B 140.4084 171.2181 0.860484
E-B 132.9433 163.8346 0.750176
E-B 144.8486 165.7402 0.857833
E-B 182.0624 223.981 0.737045
E-B 138.1152 166.9749 0.834598
E-B 182.5969 228.1908 0.618374
E-B 135.1193 164.8735 0.87525
176
E-B 155.9421 203.3862 0.800791
E-B 170.2542 205.8841 0.742724
E-B 145.8695 175.8282 0.913357
E-B 139.8621 179.9638 0.807157
E-B 170.7393 207.0985 0.850856
E-B 140.3855 177.5433 0.839968
E-B 146.8663 190.462 0.827966
E-B 152.8863 192.7258 0.685114
E-B 152.4733 194.3408 0.802668
E-B 131.9154 172.8414 0.858238
E-B 120.0153 152.5974 0.893512
E-B 168.0297 196.0895 0.847438
177
Table 3.2 Optimizing the number of nodes per plant (NP), number of non-
Reproductive nodes per plant (NRNP), number of reproductive nodes per plant (RNP),
number of pods per Plant (PP), and ratio of number of pods to number of nodes per
plant (P/N) according to the E-B-GA for maximizing soybean seed yield.
Input variables Predicted Yield
(t ha-1) NP NRNP RNP PP P/N
17.32 3.07 14.25 48.98 2.83 5.64
178
Figure 3.1 The schematic diagram of the genetic algorithm as the single objective
evolutionary optimization algorithm. The mutation and crossover rates as well as the
number of chromosomes were set to 0.1, 0.7, and 100, respectively. Also, the Roulette
wheel was used for selecting elite populations for crossover to obtain the appropriate
fitness. In order to achieve the generation number, the generational practice was
iteratively implemented. The lower and upper bounds of the dataset were considered as
the constraints in the optimization process
179
Figure 3.2 The experimental workflow of algorithm selection and validation for predicting
the soybean seed yield. The number of nodes per plant (NP), number of non-reproductive
nodes per plant (NRNP), number of reproductive nodes per plant (RNP), number of pods
per plant (PP), ratio of number of Pods to number of Nodes per plant (P/N), and genetic
algorithm (GA). Yield components were considered as input and yield was considered as
the output variable. Then, the dataset divided into training and testing based on 5-fold
cross validation strategy. All the tested machine learning algorithms were trained based
on training dataset and their efficiencies were tested using testing dataset. The
appropriate algorithm was selected based on the comparative analysis of different
machine learning algorithms in predicting soybean seed yield using hyperspectral
reflectance data.
180
Figure 3.3 Pearson correlation analysis of soybean yield component traits at α=0.05. The
number of nodes per plant (NP), number of non-reproductive nodes per plant (NRNP),
number of reproductive nodes per plant (RNP), number of pods per plant (PP), and ratio
of number of Pods to number of Nodes per plant (P.N). The colour coded heatmap scale
is provided.
181
Figure 3.4 Model training accuracy for Random Forest (RF), Multilayer Perceptron (MLP),
and Radial Basis Function (RBF) algorithms by adding variables based on the correlation
results. The number of nodes per plant (NP), number of non-reproductive nodes per plant
(NRNP), number of reproductive nodes per plant (RNP), number of pods per plant (PP),
and ratio of number of Pods to number of Nodes per plant (P/N).
182
Figure 3.5 (A) coefficient of determination (R2), (B) the Root Mean Square Error (RMSE)
and (C) the Mean Absolute Errors (MAE) performance of Random Forest (RF), Multilayer
Perceptron (MLP), and Radial Basis Function (RBF) algorithms, and the Ensemble-
Bagging (E-B) strategy for soybean yield prediction using yield component traits. The
mean performance is indicated with an + sign in each Figure.
184
4 Chapter 5: Identification of genomic regions underlying
soybean yield and its components using conventional and
machine learning mediated Genome-Wide Association
Studies
4.1 Abstract
Soybean [Glycine max (L.) Merr.] is the largest oilseed crop in the world, which has shown
relatively low historical genetic gain for seed yield over the past several decades. It is
conceivable that a better understanding of the genetic architecture of soybean yield
components may facilitate increasing the rate of genetic gain of this important trait.
Genome-wide association study (GWAS) is currently one of the few approaches for
discovering genomic regions associated with traits of interest. However, insufficient
statistical power is the limiting factor in using conventional GWAS for characterizing
quantitative traits, especially in plant species with narrow genetic bases. Machine learning
algorithms implemented in GWAS have shown great potential for detecting plausible
genetic and biological correlations for complex traits such as soybean yield and its
components. In this study, we evaluated the potential use of two most commonly used
machine learning algorithms of support vector machine (SVR) and random forest (RF) in
185
GWAS, along with two conventional methods of mixed linear models (MLM) and fixed
and random model circulating probability unification (FarmCPU), for identifying genomic
regions associated with soybean yield components. Important soybean yield component
traits, including the number of reproductive nodes per plant (RNP), number of non-
reproductive nodes per plant (NRNP), number of nodes per plant (NP), and number of
pods per plant (PP) along with yield and maturity were evaluated using data on a genomic
panel consisting of 227 soybean genotypes evaluated across four environments. Our
results indicated high correlations between soybean yield and its components. SVR-
mediated GWAS outperformed RF, MLM and FarmCPU in discovering the most relevant
quantitative trait loci (QTL) associated with maturity, yield and its components supported
by the functional annotation of candidate gene analyses. Most of the QTL and/or their
associated candidate genes identified by SVR-mediated GWAS were previously reported
as associated with the traits of interest in different genetic backgrounds. The results of
this study indicate that sophisticated mathematical approaches such as ML algorithms
may improve the efficiency of GWAS for identifying QTL suitable for genomic-based
breeding programs.
4.2 Introduction
Soybean (Glycine max [L.] Merr.) is known as one of the most important strategic crops
with substantial economic value (Rębilas et al., 2020). Soybean is widely used for food,
186
feed, fiber, biodiesel, and green manure (Temesgen and Assefa, 2020). Despite the
importance of genetic improvement in soybean yield, the soybean germplasm has in
general a narrow genetic basis, especially within North American germplasm, which has
resulted in limited enhancement of the genetic gain, historically (Xavier and Rainey,
2020). Therefore, there is a great need for analytical breeding to explore the optimum
genetic potential of soybean (Suhre et al., 2014; Mangena, 2020).
Analytical breeding strategy as an alternate breeding approach requires a better
understanding of the factors, or individual traits, responsible for the development, growth,
and yield (Richards, 1982b). This strategy considers highly correlated secondary traits
with the trait of interest as the selection criteria that can make empirical selection more
efficient for improving the genetic gain (Richards, 1982b; Reynolds, 2001; Xavier and
Rainey, 2020). The application of the analytical approaches in plant breeding programs
has been limited due mainly to lack of sufficient resources, as they are time and labor-
consuming (Richards, 1982b; Xavier et al., 2018). Therefore, breeders are restricted to
evaluating secondary traits in a small number of genotypes, which results in the limitation
of the knowledge in the genome-to-phenome analysis process (Robinson et al., 2009;
Kahlon et al., 2011; Nico et al., 2019).
Yield potential in soybean is mainly determined by the following yield component
traits: the total number of pods, nodes, reproductive nodes, non-reproductive nodes, and
pods per plant (Reynolds, 2001; Pedersen and Lauer, 2004; Xavier et al., 2018; Xavier
187
and Rainey, 2020; Yoosefzadeh-Najafabadi et al., 2021c). Among the above traits, the
total number of nodes and pods play more important roles in seed yield production than
others (Robinson et al., 2009; Yoosefzadeh-Najafabadi et al., 2021c). Several studies
reported a steady increase in the total number of nodes and the total number of pods in
soybean cultivars from 1920 to 2010 (Kahlon et al., 2011; Suhre et al., 2014; Xavier and
Rainey, 2020). These findings may highlight the importance and potential use of the
phenotypic and genotypic information on these traits, along with yield per se, as selection
criteria in cultivar development programs (Ma et al., 2001).
Genetic studies of soybean yield component traits can accelerate the breeding
process more accurately (Xavier and Rainey, 2020). Genome-Wide Association Studies
(GWAS), as one of the common genetic approaches, can be implemented on diverse
populations to detect the quantitative trait loci (QTL) associated with the soybean yield
component traits (Kaler et al., 2020). Associated QTL can be used for screening large
soybean populations in a short time with less elaborate efforts (Xavier et al., 2018).
Several GWAS algorithms have been developed for genetic studies, such as mixed linear
models (MLM), multiple loci linear mixed model (MLMM), and fixed and random model
circulating probability unification (FarmCPU) (Kaler et al., 2020). However, due to the
narrow genetic base of some plant species, including soybean, the conventional
approaches may not have enough statistical power to detect reliable QTL (Kaler et al.,
2020; Mohammadi et al., 2020; Xavier and Rainey, 2020). Therefore, the development of
188
more sophisticated statistical methods is required in order to establish effective GWAS
methods for plant species with a narrow genetic base.
Current GWAS methods are based on the conventional statistical methods that are
useful for studying less complex traits in plant species with broader genetic bases (Lipka
et al., 2015; Pasaniuc and Price, 2017). Machine learning (ML) algorithms as powerful
and reliable mathematical methods can be considered as an alternative to conventional
statistical methods for performing GWAS, which are efficient for studying more complex
traits in plants with narrow genetic base (Xavier and Rainey, 2020). Recently, the use of
ML algorithms has been reported in different areas such as plant science (Hesami et al.,
2020d; Yoosefzadeh-Najafabadi et al., 2021b), animal science (Tulpan, 2020), human
science (Chen and Verghese, 2020), engineering (Kim et al., 2020a), and computer
science (Jordan and Mitchell, 2015). The application of ML algorithms in GWAS was
previously investigated in humans by Szymczak et al. (2009). They explained a possible
use of different ML algorithms such as artificial neural networks (ANN), Bayesian network
analysis (BNA), and random forests (RF) in GWAS for human disease studies (Szymczak
et al., 2009). One of the most common used ML algorithms is RF developed by Breiman
(2001), which generates a series of trees from the independent samples for better
prediction performance (Meinshausen, 2006). The latter algorithm has been widely used
in plant genomics (Ogutu et al., 2011), phenomics (Yoosefzadeh-Najafabadi et al.,
2021b), proteomics (Jamil et al., 2020), and metabolomics (Sun et al., 2020b).
189
The first and only use of the RF-mediated GWAS in soybean, for detecting the
genomics association in soybean yield component traits, was reported by Xavier and
Rainey (2020). Support vector machine (SVM) is another common algorithm that can
detect behavior and patterns of nonlinear relationships (Auria and Moro, 2008; Su et al.,
2017; Hesami and Jones, 2020a). Theoretically, SVM should have high performance due
to the use of structural risk minimization instead of the empirical risk minimization
inductive principles (Belayneh et al., 2014; Yoosefzadeh-Najafabadi et al., 2021b). There
is a significant number of reports on the successful using of SVM in prediction problems
(Duan et al., 2005; Denton and Salleb-Aouissi, 2020; Hesami et al., 2020d; Tulpan, 2020;
Yoosefzadeh-Najafabadi et al., 2021b). Support vector regression (SVR) is known as the
regression version of SVM. There are also reports on the successful use of SVR for
addressing plant prediction problems (Awad and Khanna, 2015), which was the first
instance that SVR was used in genetic association studies.
In this study we aimed to: (1) gain a better understanding of the genetic relationships
between soybean yield and its component traits, and (2) investigate the potential use of
RF and SVM algorithms in GWAS for discovering QTL underlying soybean yield
components as compared to conventional GWAS methods of MLM and FarmCPU. The
results of this study will help soybean breeders to have a better perspective of exploiting
ML algorithms in GWAS studies, and may offer them new genomic tools for screening
190
high yielding genotypes with improved genetic gain based on genomic regions associated
with yield components.
4.3 Materials and Methods
4.3.1 Population and experimental design
An GWAS panel of 250 soybean genotypes was grown at the University of Guelph,
Ridgetown Campus in two locations, Palmyra (42°25'50.1"N 81°45'06.9"W, 195 m above
sea level) and Ridgetown (42°27'14.8"N 81°52'48.0"W, 200m above sea level) in Ontario,
Canada, in two consecutive years, 2018 and 2019. The panel used in this study consisted
of the main germplasm of the soybean breeding program at the University of Guelph,
Ridgetown Campus, that has been established over 35 years for cultivar development
and genetic studies. The randomized complete block design (RCBD) with two replications
was used for all four environments. In general, there were 500 and 1000 research plots
per environment and year, respectively. Each plot consisted of five 4.2 m long rows with
57 seeds per m2 seeding rate.
4.3.2 Phenotyping
In this experiment, soybean seed yield (t ha-1) for each plot was estimated by harvesting
three middle rows and adjusting it to 13% moisture level and maturity data. Soybean seed
yield components, including the total number of reproductive nodes per plant (RNP), the
191
total number of non-reproductive nodes per plant (NRNP), the total nodes per plant (NP),
and the total number of pods per plant (PP), were measured using 10 randomly selected
plants from each plot. The maturity was recorded as the number of days from planting to
physiological maturity (R7, (Fehr and Caviness, 1971) for each genotype.
4.3.3 Genotyping
Upon collecting young trifoliate leaf tissue for each soybean genotype from the first
replication of the trail from one plant at the Ridgetown in 2018, each sample was stored
in a 2 mL screw-cap tube. The samples were stored at -80°C after freeze-drying it for 72
hours, using the Savant ModulyoD Thermoquest (Savant Instruments, Holbrook, NY). By
using the DNA Extraction Kit (SIGMA®, Saint Louis, MO), DNA was extracted for soybean
genotypes, and the quantity of DNAs has checked via Qubit® 2.0 fluorometer (Invitrogen,
Carlsbad, CA). For genotyping-by-sequencing (GBS) of soybean genotypes, DNA
samples were sent to Plate-forme D’analyses Génomiques at Université Laval (Laval,
Quebec, Canada). The GWAS panel was genotyped via a GBS protocol based on the
enzymatic digestion with ApeKI (Kaur et al., 2017). Single-nucleotide polymorphisms
(SNPs) were called by the Fast GBS pipeline (Torkamaneh et al., 2017), along with the
Gmax_275_v2 reference genome was used during the GBS stage. Markov model was
used to impute the missing loci (34%), and a minor allele frequency less than 0.05 was
used to remove the markers below the threshold. In general, after checking the quality of
reading sequence and removing SNPs with more than 50% heterozygosity, 23 genotypes
192
were eliminated from the experiment and 17,958 high-quality SNPs from 227 soybean
genotypes used for genetic analysis.
4.3.4 Statistical analyses
The best linear unbiased prediction (BLUP) as one of the common linear mixed models
(Goldberger, 1962) was used to estimate the genetic values of each soybean genotype.
Also, R package lme4 (Bates et al., 2014) was used to analyze yield and yield
components with ‘environment’ as a fixed effect and ‘genotype’ as a random effect. To
control for the possible soil heterogeneity among the plots within a given replicate/block
and reduce the associated experimental errors, nearest-neighbor analyses (NNA) was
used as one of the common error control methods (Stroup and Mulitze, 1991; Bowley,
1999; Katsileros et al., 2015a). Outliers were determined in the raw dataset based on the
protocols proposed by Bowley (1999) and treated the same as missing data points in the
analysis. Overall, the following statistical model was used in this study:
𝑌𝑖𝑗 = 𝜇 + 𝑓(𝑠) + 𝐺𝑖 + 𝐸𝑗 + 𝐺𝐸𝑖𝑗 + 𝜀𝑖𝑗 , 𝑖 = 1, … , 𝑘; 𝑗 = 1, … , 𝑛 (Eq 5.1)
Where Yij stands for the trait of interest (soybean seed yield and yield component traits)
as a function of an intercept μ, f(s) stands for the spatial covariate, Gi is the random
genotype effect, Ej stands for the fixed environment effect, GEij is the genotype x
environment interaction effect, and εij stands for the residual effect.
193
The heritability was calculated for soybean seed yield and yield components using
lme4 open-source R package (Bates et al., 2007) based on the following equation:
𝐻2 =Ϭ𝐺
2
Ϭ𝐺2 + Ϭ𝐸
2 (Eq 5.2)
where Ϭ2G stands for the genotypic variance, and Ϭ2
E is the environmental variance.
4.3.5 Analysis of population structure
A total of 17,958 high-quality SNPs from 227 soybean genotypes were used to conduct
population structure analysis using fastSTRUCTURE (Raj et al., 2014). Five runs were
conducted for K set from 1 and 15 to estimate the most appropriate number of
subpopulations by using the K tool from the fastSTRUCTURE software.
4.3.6 Association studies
Since different GWAS methods may capture different genomic regions (Yang et al.,
2018), MLM and FarmCPU are the two most common GWAS methods compared with
the RF and SVM as the two most common machine learning algorithms that are used in
this study. MLM and FarmCPU were implemented by using GAPIT package (Lipka et al.,
2012), and RF, as well as SVM, were conducted through the Caret package (Kuhn et al.,
2020) in R software version 3.6.1. A brief description of each of the GWAS methods is
provided below:
194
Mixed Linear Model (MLM): This GWAS is based on the likelihood ratio between
the full model, consisting of the marker of interest, and the reduced model, which is known
as the model without the marker of interest (Wen et al., 2018).
Fixed and random model circulating probability unification (FarmCPU): This
GWAS takes the advantages of using MLM as the random model, and stepwise
regression as the fixed model iteratively (Liu et al., 2016). False discovery rate (FDR) is
used for setting the threshold both in the FarmCPU and MLM models (Benjamini and
Hochberg, 1995).
Random Forest (RF): This machine-learning algorithm was first implemented by
Xavier and Rainey (2020) in a soybean GWAS study. This method is known as the
powerful non-parametric regression approach that is derived from aggregating the
bootstrapping in various decision trees (Breiman, 2001). In this experiment, a 1000-set of
decision trees constructed the forest, and the GWAS analysis was done by measuring
the importance of each feature (Botta et al., 2014), which was an SNP in this study.
Support vector regression (SVR): This machine learning algorithm is known as
one of the common supervised learning methods in prediction problems (Cortes and
Vapnik, 1995). This algorithm is based on constructing a set of hyperplanes that can be
useful in regression problems (Fletcher, 2009). The association statistics in this algorithm
can be achieved by estimating the feature importance that was previously proposed by
195
Weston et al. (2001). In this experiment, SNP markers were selected as inputs, and the
traits were selected as target variables for estimating the feature importance.
4.3.7 Variable Importance measurement
As one of the common indices for tree-based algorithms, the impurity index was chosen
as the metric of the feature importance for the RF algorithm. Regarding the SVR
algorithm, the variable importance method for SVR Weston et al. (2001) was implemented
in this dataset. For both algorithms, the importance of each SNP was scaled based on 0
to 100 percent scale. Since there is no confirmed way of defining the significant threshold
in the tested algorithms, the global empirical threshold that provides the empirical
distribution of the null hypothesis (Churchill and Doerge, 1994; Doerge and Churchill,
1996) was used for establishing threshold in this study. The global empirical threshold
was estimated based on fitting the ML algorithm, storing the highest variable importance,
repeating 1000 times, and select the SNPs based on α=0.05.
4.3.8 Data-driven model processes
In order to estimate the feature importance in RF and SVR algorithms, a five-fold cross-
validation strategy (Siegmann and Jarmer, 2015) with ten repetitions was applied on the
dataset. All of the tested machine learning algorithms were optimized for their parameters
for this dataset accordingly.
196
4.3.9 Functional annotation of candidate SNPs
For each tested GWAS model, the flanking regions of each QTL was determined using
LD decay distance (Figure 5.1), and then potential candidate genes were retrieved using
the G. max cv. William 82 reference genome, gene models 2.0 in SoyBase
(https://www.soybase.org). After listing potential candidate genes in defined windows
around each significant SNP, at the peak of each QTL, Gene Ontology annotation, Go
term enrichment (https://www.soybase.org), and the report from previous studies were
used as the criteria to select and report the most relevant candidate genes associated
with the identified QTL. The Electronic Fluorescent Pictograph (eFP) browser for soybean
(www.bar.utoronto.ca) was also used to generate additional information such as tissue-
and developmental-stage dependent expression (based on transcriptomic data from
Severin et al. (2010)) about the identified candidate genes.
4.3.10 Visualization
All of the visualizations in this study were conducted using the ggplot2 package
(Wickham, 2011) in R version 3.6.1 software and Microsoft Excel software (2016).
197
4.4 Results
4.4.1 Phenotyping evaluations
The tested GWAS panel of 227 soybean genotypes showed significant variations among
the genotypes for seed yield, maturity, and yield component traits. The distribution of the
phenotypic measures for the traits across the four environments is presented in Figure
5.2. The highest heritability was observed for maturity with an estimate of 0.78 followed
by 0.34, 0.33, 0.31, and 0.30 for NP, RNP, NRNP, and PP, respectively (Figure 5.2). The
lowest heritability was estimated for yield with a value of 0.24 (Figure 5.2). Soybean seed
yield and PP showed the highest variability across the environments (Figure 5.2).
The linear correlations among all the measured traits were estimated using the
coefficients of correlation (r). Based on the results (Figure 5.3), all the traits were positively
correlated with each other, except the NRNP that was negatively associated with yield,
maturity, RNP, NP, and PP. NP showed the highest correlation with the RNP (r= 0.97)
and NRNP (r= -0.63). The trait with highest correlation with yield was found to be RNP
with the coefficient value of 0.86 (Figure 5.3).
4.4.2 Genotyping evaluations
For the tested GWAS panel, high-quality SNPs were obtained from 210M single-end Ion
Torrent reads were proceeded with Fast-GBS.v2. From a total of 40,712 SNPs, 17,958
198
SNPs were polymorphic and mapped to 20 soybean chromosomes. The minimum and
maximum number of SNPs were 403 and 1780 on chromosomes 11 and 18, respectively.
Overall, the average number of SNPs across all the 20 chromosomes was 898, with the
mean density of one SNP for every 0.12 cM across the genome.
4.4.3 Population structure and kinship
The structure profile for the tested population is presented in Figure 5.4. The result of
genotypic evaluations suggested that the tested GWAS panel was composed of four to
seven subpopulations. Therefore, we chose to conduct the structure analysis using K=7
as the appropriate K for the structure profile of the tested GWAS panel (Figure 5.4). In
order to reduce the confounding, the kinship was also estimated between genotypes of
the GWAS panel.
4.4.4 GWAS analysis
The average value for soybean maturity in the tested GWAS panel was 106 days with a
standard deviation of 5 days (Figure 5.3). Association analysis by the MLM method
identified nine associated SNP markers located on chromosomes 2 and 19 (Figure 5.5-
A). Using FarmCPU, a total of nine associated SNP markers were located on
chromosomes 2, 19, and 20 (Figure 5.5-A). By using the RF method, the total of three
SNP markers on chromosomes 3, 16, and 17 were associated with the soybean maturity,
199
whereas SVR-mediated GWAS detected 11 associated SNP markers located on
chromosomes 2, 6, 10, 16, 19, and 20 (Figure 5.5-A).
SVR-mediated GWAS detected five QTL directly related to the reproductive period
and R8 full maturity (Table 5.1). The average soybean seed yield in the GWAS panel was
3.5 t ha-1 with a standard deviation of 0.45 (Figure 5.3). Using MLM, FarmCPU, RF, and
SVR approach, we identified two, three, five, and 18 SNP markers associated with the
yield, respectively (Figure 5.5-B). The SNP markers identified by MLM and FarmCPU
were located on chromosomes 6 and 8. Using the RF-mediated GWAS method,
associated SNP markers were located on chromosomes 4, 7, 12, and 17. By using the
SVR-mediated GWAS method, the SNP markers were located on chromosomes 3, 4, 6,
7, 15, 19, and 20 (Figure 5.5-B). This method identified also eight QTL directly related to
seed yield, seed weight, and seed set (Table 5.2). However, other tested GWAS methods
could not detect any QTL directly associated with seed yield (Table 5.2).
The average NP in the tested GWAS panel was 15.21 nodes with a standard
deviation of 0.77 nodes (Figure 5.3). By using the MLM and FarmCPU methods, one and
two associated SNP markers were detected, respectively (Figure 5.6-A). Four and ten
associated SNP markers were detected by NP using RF and SVR methods, respectively.
SVR-mediated GWAS was the only method that detected three QTL directly related to
the NP (Table 5.3). The average NRNP was 3.33 nodes with a standard deviation of 0.28
nodes (Figure 5.3). A total of two, three, five, and ten associated SNP markers were
200
detected using the MLM, FarmCPU, RF, and SVR methods, respectively (Figure 5.6-B).
The detected SNP markers using the SVR method were located on chromosomes 4, 7,
18, 19, and 20, whereas SNP markers identified through RF were located on
chromosomes 1, 4, 7, 18, and 19 (Figure 5.6-B). Chromosomes number 4, 8, and 15 were
identified as carrying SNP markers with NRNP using FarmCPU. The MLM method
identified SNP markers located on chromosomes 8 and 15, which most of the detected
QTL being related to seed weight, seed protein, water use efficiency, first flower, and
soybean cyst nematode (Table 5.4).
The average RNP was 11.89 nodes with a standard deviation of 0.98 nodes (Figure
5.3). Based on the results of MLM and FarmCPU methods, four associated SNP markers
with RNP were located on chromosomes 8 and 19. Using the RF method, four associated
SNP markers were identified on chromosomes 8, 9, 15, and 20. Using the SVR method,
11 SNP markers were associated with RNP located on chromosomes 4, 7, 8, 15, 18, 19,
and 20 (Figure 5.7-A). Regardless of the type of GWAS methods used in this study, we
found SNP markers associated with the trait on chromosome 8. The position of the
associated SNP marker on chromosome 8 was identical both in SVR and RF (462.3 Kbp)
and MLM and FarmCPU (481.6 Kbp). The list of detected QTL for RNP is presented in
Table 5.5. The average value for PP in the tested GWAS panel was 45.02 pods with a
standard deviation of 8.54 pods. We did not detect any SNP marker associated with PP
using the MLM and FarmCPU methods. However, by using the RF method, four SNP
201
markers were found to be associated with PP and located on chromosomes 7, 10, 19,
and 20 (Figure 5.7-B). Twelve associated SNP markers were found by SVR that were
located on chromosomes 6, 9, 10, 11, 15, 18, and 19 (Figure 5.7-B). The GWAS of
chromosome 10 with PP were found both in RF and SVR with 4.6 cM distance far from
each other. In PP, MLM and FarmCPU did not detect any related QTL for this trait, while
SVR-mediated GWAS was identified seven QTL directly related to the pod number (Table
5.6).
In order to test the possibility of detecting same SNP markers in plant height,
maturity, and yield, plant height was analyzed using MLM, FarmCPU, RF, and SVR
methods. Using MLM and FarmCPU methods, four associated SNP markers were
detected for plant height located on chromosomes 10 and 12, while RF detected three
SNP markers located on chromosomes 2 and 16. SVR-mediated GWAS was able to
detect three SNP markers associated with plant height, which located on chromosomes
1 and 11 (Figure 5-8). None of the detect SNP markers for plant height were detect in
yield, maturity, and the tested yield component traits.
4.4.5 Identification of candidate genes within QTL
In the GWAS panel, the 150-kbp flanking regions of each SNP’s peak were considered
for identifying candidate genes within QTL (Figure 5.1). For maturity, three SNP peaks
(Chr2_695362, Chr2_720134, and Chr19_47513536) had the highest allelic effect than
202
other detected SNP peaks (Figure 5.9-A). On the basis of the gene annotation and
expression within QTL, Glyma.02g006500 (GO:0015996) and Glyma.19g224200
(GO:0010201) were identified as the strong candidate genes for maturity, which encode
chlorophyll catabolic process and phytochrome A (PHYA) related genes, respectively.
Glyma.02g006500 (GO:0015996) was exactly detected in the SNP peak position of
Chr2_695362, whereas Glyma.19g224200 (GO:0010201) was 119 Kbp far from the
detected SNP peak of Chr19_47513536. In yield, the SNP peak with the position of
Chr7_1032587 had the highest allelic effect in comparison with other detected SNP peaks
(Figure 5.9-B). Within a 77 Kbp above from the detected SNP peak (Chr7_1032587),
Glyma.07G014100 (GO:0010817) was identified, which encodes the regulation of
hormone levels, as the strongest candidate genes in yield.
For NP, two SNP peaks (Chr7_1032587 and Chr7_1092403) had the highest allelic
effect among all detected SNP peaks (Figure 5.9-C). SNP peak position of Chr7_1032587
was detected in common for yield, NP, and NRNP. Glyma.07G205500 (GO:0009693) and
Glyma.08G065300 (GO:0042546) were detected as the strongest candidate genes both
in NP and NRNP, which encode UBP1-associated protein 2C and cell wall biogenesis,
respectively. Both detected gene candidates were exactly at the associated SNP peaks
of Chr7_1032587 and S08_5005929.
Regarding SNP peaks associated with RNP, the highest allelic effects were found
in SNP peaks of Chr9_40285014 and Chr15_34958361 (Figure 5.9-E).
203
Glyma.15G214600 (GO:0009920) and Glyma.15G214700 (GO:0009910), which encode
cell plate formation involved in plant-type cell wall biogenesis and acetyl-CoA biosynthetic
process, as strong candidate genes in NRNP. Glyma.15G214600 (GO:0009920) and
Glyma.15G214700 (GO:0009910) were 127 and 90 Kbp far from the detected SNP peak
of Chr15_3495836, respectively. In PP, the highest allelic effects were found in SNP
peaks of Chr7_15331676, Chr11_5245870, and Chr18_55469601 (Figure 5.9-F).
Glyma.07G128100 (GO:0009909) was the strongest candidate genes for PP, which
encodes regulation of flower development. Glyma.07G128100 (GO:0009909) was
detected exactly in the SNP peak position of Chr7_15331676.
4.5 Discussion
One of the objectives of this study was to have a better understanding of the roles of
soybean yield component traits in the production of total seed yield and how these traits
can be used for facilitating the development of high-yielding soybeans. The genetic
dissection of soybean yield component traits in order to develop genetic and genomics
toolkits can be useful for designing breeding population and selection criteria aiming at
improving yield genetic gains in new cultivars (Cooper et al., 2009; Hu et al., 2020; Xavier
and Rainey, 2020). For this aim, a wide range of analyses, including Pearson correlation,
normality and distribution plots, GWAS both in combined and separate environments, and
functional annotation of candidate genes and QTL, were performed in this study. The
204
collective evaluation of the mentioned analysis contributes to building the wide
perspectives of the genetic architecture of the soybean yield component traits. One of the
important factors for genetic studies is to evaluate the phenotypic variation within
genotypes and environments. High phenotypic variation was observed for yield and PP,
while maturity and NP had the lowest phenotypic variation across the tested
environments. These findings are in line with the results of previous research on yield
component traits (Kahlon and Board, 2012; Xavier and Rainey, 2020). The heritability and
correlation analyses showed that NP had the highest heritability and significant linear
correlations with RNP and PP. Also, PP had the highest correlation with yield among all
the tested soybean yield components. The number of nodes and pods in soybean are
known as the two of the key soybean yield components that play an important role in
determining the final soybean seed yield (Herbert and Litchfield, 1982; Kahlon and Board,
2012; Xavier and Rainey, 2020). However, studies showed the low heritability rates for
soybean yield components, especially NP and PP (Xavier et al., 2016a; KUSWANTORO,
2017; Sulistyo and Sari, 2018; Xavier and Rainey, 2020). The nature of these traits can
explain low heritability rates as they are mostly affected by environmental factors (Price
and Schluter, 1991). Although heritability indicates the strength of the relationship
between phenotype and genetic variation of the particular trait, it does not indicate the
value of the trait for genetic study (Cassell, 2009). Different low heritable traits are highly
correlated with significant economic traits (Cassell, 2009). In soybean, yield can be
205
considered as the most important economic trait that is highly determined by yield
components. Therefore, any genetic and environmental studies around yield components
can open the possibility of overall yield improvement in major crops such as soybean.
GWAS is known as one of the most important genetic toolkits for detecting QTL
associated with quantitate traits (Kaler et al., 2020). There are several statistical methods
implemented in GWAS for improving the detection of associated SNP markers with the
trait of interest. While conventional GWAS are appropriate approaches for detecting SNP
markers with large main effects on complex traits, they are, however, underpowered for
the simultaneous consideration of a wide range of interconnected biological processes
and mechanisms that shape the phenotype of complex traits (Lee et al., 2020). Therefore,
using variable importance values in ML algorithms for identifying SNP-trait associations
may improve the power of ML-mediated GWAS for discovering variant-trait association
with higher resolution (Szymczak et al., 2009). The variable importance methods based
on linear and logistic regressions, support vector machines, and random forests are well
established in the literature (Grömping, 2009; Wu and Liu, 2009a; Williamson et al., 2020;
Yoosefzadeh-Najafabadi et al., 2021b). Among all the tested GWAS methods in this
study, SVR-mediated GWAS was the best method to detect SNP markers with high allelic
effects associated with the tested traits. The advantage of SVR-mediated GWAS over
conventional GWAS models can be explained by the presence of a nonlinear relationship
between input and output variables, which is used to build an algorithm with accurate
206
prediction ability (Kaneko, 2020). Therefore, genomic regions could be better detected by
SVR-mediated GWAS because of its ability to consider the interaction effects between
SNPs rather than p-values for individual SNP-trait GWAS tests.
None of the detected QTL by MLM, FarmCPU, and RF were reported to be
associated directly with soybean maturity. However, using SVR-mediated GWAS, five
QTL were detected on chromosomes 16 and 19 specifically related to the soybean
maturity. Those QTL were previously reported by Sonah et al. (2015) and Copley et al.
(2018) in separate studies. Also, the SNP peak position of Chr19_47513536 detected by
SVR-mediated GWAS had the highest allelic effect among all the detected SNP’s peak
in soybean maturity, which is in line with Sonah et al. (2015). For soybean seed yield, five
QTL detected by SVR-mediated GWAS were reported previously (Hu et al., 2014; Copley
et al., 2018), while none of the detected QTL from other tested GWAS methods was
previously reported for this trait. There was no previous study on the genetic structure of
NRNP and RNP, therefore, all the detected QTL in this study are presented for the first
time. For PP, conventional GWAS methods were not able to detect any associated QTL.
However, using SVR-mediated GWAS, a total of seven QTL were detected to be related
to pod numbers based on previous studies (Zhang et al., 2015a). It would be necessary
to emphasize that the average allelic effects of all detected QTL presented in Figure 5.8
was not directly estimated by the tested GWAS methods. The RF and SVR-mediated
GWAS methods do not specifically provide an allele effect therefore, the aim of this study
207
was mostly focused on detecting the associated genes and QTL underlying the soybean
yield, maturity, and yield components.
The results of candidate gene identifications within identified QTL by SVR-mediated
GWAS analyses reveled important information. For example, from all the detected genes
using SVR-mediated GWAS for maturity, candidate gene Glyma.02g006500
(GO:0015996) is a protein ABC transporter 1, that is annotated as a chlorophyll catabolic
process and located exactly in the SNP peak position Chr02_695362. ATP-binding
cassette (ABC) transporter genes play conspicuous roles in different plant growth and
developmental stages by transporting different phytochemicals across endoplasmic
reticulum (ER) membranes (Hwang et al., 2016). Because of the central roles of ABC
transporters in transporting biomolecules such as phytohormones, metabolites, and
lipids, they play important roles in plant growth and development as well as maturity
(Block and Jouhet, 2015; Hwang et al., 2016). Moreover, recent studies revealed that ER
uses fatty acid building blocks made in the chloroplast to synthesize Triacylglycerol
(TAG). Therefore, ABC transporter genes are important for the normal accumulation of
Triacylglycerol (TAG) during the seed-filling stage and maturity (Kim et al., 2013; Block
and Jouhet, 2015). Additionally, Glyma.19g224200 (GO:0010201) in E3 locus, which was
previously discovered by Buzzell (1971) and molecularly characterized as a phytochrome
A (PHYA) gene (Watanabe et al., 2009), was detected through the SVR-mediated GWAS.
Phytochromes, through PHYTOCHROME INTERACTING FACTOR (PIF), regulate the
208
expression of some specific genes encoding rate-limiting catalytic enzymes of different
plant growth regulators (e.g., abscisic acid, gibberellins, auxin) and, therefore, play crucial
roles in plant maturity (Legris et al., 2019). In addition, PHYB is inactivated after imbibition
shade signals, which repress PHYA-dependent signaling in the embryo that results in the
maturity of seeds by preventing germination (Casal, 2013; De Wit et al., 2016). This is
obtained by regulating the balance between abscisic acid and gibberellin. Subsequently,
abscisic acid transports from the endosperm to the embryo by ABC transporter (De Wit
et al., 2016).
Regarding NRNP, candidate gene Glyma.07G205500 (GO:0009693- UBP1-
associated protein 2C) that annotated as ethylene biosynthetic process was located
exactly in the SNP peak position of Chr7_37469678, was detected by SVR-mediated
GWAS. An interaction screen with the heterogeneous nuclear ribonucleoprotein (hnRNP)
results in the production of oligouridylatebinding protein 1 (UBP1)-associated protein
(Lambermon et al., 2002). It has been well documented that this protein plays important
roles in several physiological processes such as responses to abiotic stresses (Li et al.,
2002), leaf senescence(Kim et al., 2008), floral development (Streitner et al., 2008), and
chromatin modification (Liu et al., 2007). In addition, previous studies showed that the
production of productive or non-reproductive nodes is completely accompanied by the
upregulation or downregulation of this protein (Bäurle and Dean, 2008; Na et al., 2015).
In addition, Glyma.08G065300 (GO:0042546- MADS-box transcription factor) that is
209
associated with cell wall biogenesis, was located in the SNP position of Chr8_5005929.
The genes of the MADS-box family can be considered as the main regulators for cell
differentiation and organ determination (Lee et al., 2013). The floral organ recognition
MADS-box family has been categorized into A, B, C, D, and E classes. Among these
classes, class E was shown to be associated with reproductive organ development
(Hussin et al., 2021). Indeed, activation or repression of this transcription factor leads to
the development of nodes to productive or non-productive nodes (Ditta et al., 2004; Gao
et al., 2010; Liu et al., 2013).
Gene expression data provided by Severin et al. (2010) noted that 20 candidate
genes for PP that were detected using the SVR-mediated GWAS were expressed in
flowers, 1 cm pod (7 DAF), pod shell (10-13 DAF), pod shell (14-17 DAF) and seeds. In
PP, most of the genes detected by SVR-mediated GWAS are associated with auxin influx
carrier or auxin response factors (ARFs), gibberellin synthesis, and response to
brassinosteroid (Yin et al., 2018; Lin et al., 2020). Song et al. (2020) and Li et al. (2018a)
also reported that some genes related to PP were associated with embryo development,
stamen development, ovule development, cytokinin biosynthesis, and response
gibberellin that we also identified in our study. Soybean seed yield significantly depends
on seed number and seed size (Rotundo et al., 2009; Liu et al., 2010). These two factors
are determined from fertilization to seed maturity. Therefore, soybean seed development
can be divided into three stages or phases: pre-embryo or seed set, embryo growth or
210
seed growth, and desiccation stages or seed maturation phases (Weber et al., 2005;
Ruan et al., 2012). In Arabidopsis, a complex signaling pathway and regulatory networks,
including sugar and hormonal signaling, transcription factors, and metabolic pathway,
have been reported to be involved in seed development (Le et al., 2010; Orozco-Arroyo
et al., 2015). Several key genes and transcription factors (e.g., LEAFY COTYLEDON 1
(LEC1), LEC2, FUSCA3 (FUS3), AGAMOUS-LIKE15 (AGL15), ABSCISIC ACID
INSENSITIVE 3 (ABI3), YUCCA10 (YUC10), ARFs) have been determined to control
several downstream plant growth regulators pathways to the seed development (Sun et
al., 2010; Pelletier et al., 2017; Lepiniec et al., 2018). Indeed, a high ratio of abscisic acid
to gibberellic acid can regulate seed development (Wang et al., 2016; Figueiredo and
Köhler, 2018). The downregulation of FUS3 obtains this through repressing GA3ox1 and
GA3ox2 and activating ABA biosynthesis (Weber et al., 2005). In soybean, RNA seq
analysis for the seed set, embryo growth, and early maturation stages of developing
seeds in two soybeans with contrasting seed size showed cell division and growth genes,
hormone regulation, transcription factors, and metabolic pathway are involved in seed
size and numbers (Du et al., 2017).
211
Table 4.1 The list of detected QTL for soybean maturity using different GWAS methods in the tested soybean population.
GWAS
Method Chromosome Peak SNP position Detected QTL
Environment a Reference
MLM 2
2212910 Sclero 3-g31 NA (Moellers et
al., 2017)
8233782 Seed Weight 6-g1 NA (Sonah et
al., 2015)
FarmCPU
2
2212910 Sclero 3-g31 NA (Moellers et
al., 2017)
8233766 Seed Weight 6-g1 NA (Sonah et
al., 2015)
20 37765851 WUE 2-g53 NA (Kaler et
al., 2017)
RF
3 2978272
Leaflet area 1-g2.1 NA (Fang et al.,
2017)
Leaflet width 1-g4.1 NA (Fang et al.,
2017)
Leaflet area 1-g2.2 NA (Fang et al.,
2017)
Leaflet width 1-g4.2 NA (Fang et al.,
2017)
Salt tolerance 1-g12 NA (Kan et al.,
2015)
16 5730281
Plant height 6-g17 NA (Zhang et
al., 2015b)
Plant height 1-g17 NA (Zhang et
al., 2015b)
First flower 4-g63 NA (Mao et al.,
2017)
212
17 34757372 SDS root retention 1-g6 NA (Bao et al.,
2015)
SVR
2
695362
Seed linolenic 2-g1 NA (Leamy et
al., 2017)
Seed linolenic 2-g2 NA (Leamy et
al., 2017)
720134
SDS 1-g12.1 2 (Wen et al.,
2014)
SDS 1-g12.2 2 (Wen et al.,
2014)
Ureide content 1-g2 2 (Ray et al.,
2015)
827374 SDS 1-g12.3 NA (Wen et al.,
2014)
10
1595239 Shoot Cu 1-g8 NA (Dhanapal
et al., 2018)
1689395 Seed oil 5-g3 NA (Sonah et
al., 2015)
16
2438652
Reproductive period 4-
g16 NA (Zhang et
al., 2015b)
R8 full maturity 9-g2 NA (Zhang et
al., 2015b)
2460921
Reproductive period 2-
g16 NA (Zhang et
al., 2015b)
R8 full maturity 2-g2 NA (Zhang et
al., 2015b)
19
47513536 R8 full maturity 4-g1 NA (Sonah et
al., 2015)
47513572 First flower 4-g81 NA (Mao et al.,
2017) a Detected in separate environments (1: 2018Ridgetown, 2:2019Ridgetwon, 3:2018Palmyra,
4:2019Palmyra, NA: Not found in any environment).
213
MLM: Mixed Linear Model, FarmCPU: Fixed and random model circulating probability
unification, RF: Random Forest, SVR: Support Vector Regression
214
Table 4.2 The list of detected QTL for soybean yield using different GWAS methods in the tested soybean population.
GWAS Method Chromosome Peak SNP position Detected QTL Environment a
Reference
MLM 5 34391386
Ureide content 1-g16.1 NA (Ray et al.,
2015)
Ureide content 1-g16.2 NA (Ray et al.,
2015)
FarmCPU 5 34391386
Ureide content 1-g16.1 NA (Ray et al.,
2015)
Ureide content 1-g16.2 NA (Ray et al.,
2015)
RF 7 1032587 WUE 2-g18 NA (Kaler et
al., 2017)
SVR 3
36309302
First flower 4-g10 NA (Mao et
al., 2017)
First flower 3-g2 NA (Hu et al.,
2014)
Seed weight 4-g3 NA (Hu et al.,
2014)
Seed yield 4-g2 NA (Hu et al.,
2014)
R8 full maturity 3-g3 NA (Hu et al.,
2014)
37617293
Plant height 3-g17 NA (Contreras-
Soto et al.,
2017)
Leaflet shape 1-g1.1 NA (Fang et
al., 2017)
Leaflet shape 1-g1.2 NA (Fang et
al., 2017)
215
Leaflet shape 1-g1.3 NA (Fang et
al., 2017)
Seed set 1-g32.1 NA (Fang et
al., 2017)
Seed set 1-g32.2 NA (Fang et
al., 2017)
7
44488152 Seed yield 4-g4 NA (Hu et al.,
2014)
1032587 WUE 2-g18 NA (Kaler et
al., 2017)
15 34958361 SCN 5-g35 NA (Li et al.,
2016)
19 41385139
Seed weight 5-g20 NA (Zhang et
al., 2016)
Seed weight 4-g18 NA (Hu et al.,
2014)
Seed yield 4-g5 NA (Hu et al.,
2014)
Shoot Zn 1-g28.1 NA (Dhanapal
et al.,
2018)
Shoot Zn 1-g28.2 NA (Dhanapal
et al.,
2018)
Shoot Zn 1-g29.1 NA (Dhanapal
et al.,
2018)
Shoot Zn 1-g29.2 NA (Dhanapal
et al.,
2018)
Shoot Zn 1-g29.3 NA (Dhanapal
et al.,
2018)
216
a Detected in separate environments (1: 2018Ridgetown, 2:2019Ridgetwon, 3:2018Palmyra,
4:2019Palmyra).
MLM: Mixed Linear Model, FarmCPU: Fixed and random model circulating probability
unification, RF: Random Forest, SVR: Support Vector Regression
NA
217
Table 4.3 The list of detected QTL for soybean total number of nodes per plant (NP) using different GWAS methods in the tested soybean population.
GWAS
Method
Chromosom
e
Peak SNP
position
Detected QTL Environmen
t a
Referenc
e
FarmCPU 19 40131952
Pubescence density 1-
g17
NA (Chang
and
Hartman,
2017)
Seed weight 9-g5.1 NA (Copley et
al., 2018)
RF
4 1205787 Shoot Ca 1-g10 NA (Dhanapal
et al.,
2018)
6
50570624
Seed set 1-g51.1 NA (Fang et
al., 2017)
Seed set 1-g43.1 NA (Fang et
al., 2017)
Seed set 1-g25.1 NA (Fang et
al., 2017)
Seed set 1-g43.2 NA (Fang et
al., 2017)
Seed set 1-g25.2 NA (Fang et
al., 2017)
Seed set 1-g51.2 NA (Fang et
al., 2017)
50570473
Seed set 1-g43.3 NA (Fang et
al., 2017)
Seed set 1-g51.3 NA (Fang et
al., 2017)
Seed set 1-g25.3 NA (Fang et
al., 2017)
218
Pod number 1-g3 NA (Fang et
al., 2017)
Seed palmitic 2-g2 NA (Fang et
al., 2017)
Seed long-chain faty acid
1-g22 NA (Fang et
al., 2017)
SVR
6
50570624
Seed set 1-g51.1 NA (Fang et
al., 2017)
Seed set 1-g43.1 NA (Fang et
al., 2017)
Seed set 1-g25.1 NA (Fang et
al., 2017)
Seed set 1-g43.2 NA (Fang et
al., 2017)
Seed set 1-g25.2 NA (Fang et
al., 2017)
Seed set 1-g51.2 NA (Fang et
al., 2017)
50570473
Seed set 1-g43.3 NA (Fang et
al., 2017)
Seed set 1-g51.3 NA (Fang et
al., 2017)
Seed set 1-g25.3 NA (Fang et
al., 2017)
Pod number 1-g3 NA (Fang et
al., 2017)
Seed palmitic 2-g2 NA (Fang et
al., 2017)
Seed long-chain faty acid
1-g22 NA (Fang et
al., 2017)
7 1032587 WUE 2-g18 NA (Kaler et
al., 2017)
219
1092403
WUE 2-g18 NA (Kaler et
al., 2017)
First flower 3-g4 NA (Fang et
al., 2017)
18 55645699
Leaflet shape 1-g4.1 NA (Fang et
al., 2017)
Leaflet shape 1-g4.2 NA (Fang et
al., 2017)
Leaflet shape 1-g4.3 NA (Fang et
al., 2017)
Seed stearic 4-g5 NA (Li et al.,
2015)
Node number 1-g6.1 NA (Fang et
al., 2017)
Node number 1-g6.2 NA (Fang et
al., 2017)
Pod number 1-g1.1 NA (Fang et
al., 2017)
Pod number 1-g1.2 NA (Fang et
al., 2017)
Pode number 1-g1.3 NA (Fang et
al., 2017)
WUE 3-g31 NA (Kaler et
al., 2017)
Seed weight, SoyNAM
14-g28 NA (Xavier et
al., 2016b)
Lodging, SoyNAM 4-
g15 NA (Cook et
al., 2014)
Branching 1-g1.1 NA (Fang et
al., 2017)
Plant height 5-g4.2 NA (Fang et
al., 2017)
220
Plant height 5-g4.3 NA (Fang et
al., 2017)
Shoot p 1-g30
NA (Dhanapal
et al.,
2018)
19 47350110 Node number 1-g2.3 NA (Fang et
al., 2017) a Detected in separate environments (1: 2018Ridgetown, 2:2019Ridgetwon, 3:2018Palmyra,
4:2019Palmyra, NA: Not found in any environment).
MLM: Mixed Linear Model, FarmCPU: Fixed and random model circulating probability
unification, RF: Random Forest, SVR: Support Vector Regression
221
Table 4.4 The list of detected QTL for soybean total number of non-reproductive nodes per plant (NRNP) using different GWAS methods in the tested soybean population.
GWAS Method Chromosome Peak SNP position Detected QTL Environment a
Reference
MLM 15 10193796
Seed protein 6-g2 NA (Zhang et
al., 2018b)
Seed Arg 1-g4 NA (Zhang et
al., 2018b)
Seed coat luster 1-g1.3 NA (Fang et
al., 2017)
FarmCPU 15 10193796
Seed protein 6-g2 NA (Zhang et
al., 2018b)
Seed Arg 1-g4 NA (Zhang et
al., 2018b)
Seed coat luster 1-g1.3 NA (Fang et
al., 2017)
RF
1 54647498 First flower 4-g2 NA (Mao et al.,
2017)
7 329800
Phytoph 2-g32 NA (Qin et al.,
2017)
Phytoph 2-g7 NA (Qin et al.,
2017)
18 12945778 SCN 4-g14 NA (Vuong et
al., 2015)
19 40218800 Seed weight 9-g5.1 NA (Copley et
al., 2018)
SVR
7 1032587 2 WUE 2-g18 2 (Kaler et
al., 2017)
19 40218800 Seed weight 9-g5.1 NA (Copley et
al., 2018) a Detected in separate environments (1: 2018Ridgetown, 2:2019Ridgetwon, 3:2018Palmyra,
4:2019Palmyra).
222
MLM: Mixed Linear Model, FarmCPU: Fixed and random model circulating probability
unification, RF: Random Forest, SVR: Support Vector Regression
223
Table 4.5 The list of detected QTL for soybean total number of reproductive nodes per plant (RNP) using different GWAS methods in the tested soybean population.
GWAS Method Chromosome Peak SNP position Detected QTL Environment a
Reference
RF
9 40285014
Shoot Fe 1-g8.1 NA (Dhanapal
et al., 2018)
Shoot Fe 1-g8.2 NA (Dhanapal
et al., 2018)
Shoot Fe 1-g8.3 NA (Dhanapal
et al., 2018)
Shoot Fe 1-g9 NA (Dhanapal
et al., 2018)
Shoot Fe 1-g10 NA (Dhanapal
et al., 2018)
Shoot Fe 1-g11 NA (Dhanapal
et al., 2018)
Soybean mosaic virus 2-g5 NA (Che et al.,
2017)
15 34958361 SCN 5-g35 NA (Li et al.,
2016)
SVR
7 1032587 WUE 2-g18 NA (Kaler et
al., 2017)
15 34958361 1 SCN 5-g35 1 (Li et al.,
2016) a Detected in separate environments (1: 2018Ridgetown, 2:2019Ridgetwon, 3:2018Palmyra, 4:2019Palmyra).
MLM: Mixed Linear Model, FarmCPU: Fixed and random model circulating probability unification, RF: Random
Forest, SVR: Support Vector Regression
224
Table 4.6 The list of detected QTL for soybean total number of pods per plant (PP) using different GWAS methods in the tested soybean population.
GWAS
Method
Chromoso
me
Peak SNP
position Detected QTL
Environme
nt a
Referenc
e
RF
7 15331676 Seed weight, SoyNAM 14-g11 NA (Xavier
et al.,
2016b)
19 42300695
First flower 4-g77 NA (Mao et
al., 2017)
Lodging, SoyNAM 4-g17 NA (Cook et
al., 2014)
SVR
9 39366957
Pod number 1-g4.1 NA (Fang et
al., 2017)
Pod number 1-g4.2 NA (Fang et
al., 2017)
Pod number 1-g4.3 NA (Fang et
al., 2017)
Seed thickness 2-g4 NA (Fang et
al., 2017)
9 39372117
Seed Thr 2-g1 NA (Li et al.,
2018b)
Seed Ser 2-g1 NA (Li et al.,
2018b)
Seed Tyr 2-g2 NA (Li et al.,
2018b)
Seed Lys 2-g2 NA (Li et al.,
2018b)
Seed leu 2-g2 NA (Li et al.,
2018b)
Seed ile 2-g2 NA (Li et al.,
2018b)
225
Seed Ala 2-g2 NA (Li et al.,
2018b)
Seed Gly 2-g2 NA (Li et al.,
2018b)
11 5245870
Ureide content 1-g29 NA (Ray et
al., 2015)
Pod number 1-g6 NA (Fang et
al., 2017)
18
55645699 Leaflet shape 1-g4.1 NA (Fang et
al., 2017)
55469601
Leaflet shape 1-g4.2 NA (Fang et
al., 2017)
Leaflet shape 1-g4.3 NA (Fang et
al., 2017)
Seed stearic 4-g5 NA (Li et al.,
2015)
Node number 1-g6.1 NA (Fang et
al., 2017)
Node number 1-g6.2 NA (Fang et
al., 2017)
Pode number 1-g1.1 NA (Fang et
al., 2017)
Pode number 1-g1.2 NA (Fang et
al., 2017)
Pode number 1-g1.3 NA (Fang et
al., 2017)
WUE 3-g31 NA (Dhanapa
l et al.,
2015a)
Seed weight, SoyNAM 14-g28 NA (Xavier
et al.,
2016b)
226
Lodging, SoyNAM 4-g15 NA (Cook et
al., 2014)
Branching 1-g1.1 NA (Fang et
al., 2017)
Plant height 5-g4.2 NA (Fang et
al., 2017)
Plant height 5-g4.3 NA (Fang et
al., 2017)
Shoot p 1-g30 NA (Dhanapa
l et al.,
2018)
Seed yield, SoyNAM 7-g19 NA (Cook et
al., 2014)
R8 full maturity, SoyNAM 13-
g19 NA (Cook et
al., 2014)
Plant height 5-g4.3 NA (Fang et
al., 2017)
19
43077182
Seed weight 9-g5.2 NA (Copley
et al.,
2018)
Seed weight 5-g21 NA (Copley
et al.,
2018)
First flower 5-g3 NA (Fang et
al., 2017)
First flower 5-g17 NA (Fang et
al., 2017)
47235604
First flower 4-g77 NA (Mao et
al., 2017)
Seed palmitic 1-g19 NA (Priolli et
al., 2015)
227
47350110
Leaf carotenoid content 1-g14 NA (Dhanapa
l et al.,
2015b)
Ureide content 1-g50.3 NA (Ray et
al., 2015)
Ureide content 1-g50.4 NA (Ray et
al., 2015)
47224293 Node number 1-g2.3 NA (Fang et
al., 2017)
a detected in separate environments (1: 2018Ridgetown, 2:2019Ridgetwon, 3:2018Palmyra, 4:2019Palmyra).
MLM: Mixed Linear Model, FarmCPU: Fixed and random model circulating probability unification, RF: Random
Forest, SVR: Support Vector Regression
229
Figure 4.2 The distribution of seed yield (A), maturity (B), NP (C), NRNP (D), RNP (E),
and PP (F) in 227 soybean genotypes across four environments. The estimated
heritability is provided for each of the six traits. RNP: Total number of reproductive nodes
per plant, NRNP: The total number of non-reproductive nodes per plant, NP: The total
nodes per plant, PP: The total number of pods per plant.
230
Figure 4.3 The distributions and Pearson correlations among the soybean seed yield,
maturity, and yield component traits. RNP: Total number of reproductive nodes per plant,
NRNP: The total number of non-reproductive nodes per plant, NP: The total nodes per
plant, PP: The total number of pods per plant. The heat map scale for values is provided
by colour for the panel.
231
Figure 4.4 Structure and kinship plots for the 227 soybean genotypes. The x-axis is the
number of genotypes used in this GWAS panel, and the y axis is the membership of each
subgroup. G1-G7 stands for the subpopulation.
232
Figure 4.5 Genome-wide Manhattan plots for GWAS studies of A) maturity and B) seed
yield in soybean using MLM, FarmCPU, RF, and SVR methods, from top to bottom,
respectively.
233
Figure 4.6 Genome-wide Manhattan plots for GWAS studies of A) the total number of
nodes (NP) and B) the total number of non-reproductive nodes (NRNP) in soybean using
MLM, FarmCPU, RF, and SVR methods, from top to bottom, respectively.
234
Figure 4.7 Genome-wide Manhattan plots for GWAS studies of A) The total number of
reproductive nodes (RNP) and B) the total number of pods (PP) in soybean using MLM,
FarmCPU, RF, and SVR methods, from top to bottom, respectively.
235
Figure 4.8 Genome-wide Manhattan plots for GWAS studies of plant height in soybean
using MLM, FarmCPU, RF, and SVR methods, from top to bottom, respectively.
236
Figure 4.9 The average effects of reference allele and alternative allele from the detected
SNP’s peak for seed yield (A), maturity (B), NP (C), NRNP (D), RNP (E), and PP (F) in
227 soybean genotypes across four environments. RNP: Total number of reproductive
nodes per plant, NRNP: The total number of non-reproductive nodes per plant, NP: The
total nodes per plant, PP: The total number of pods per plant
237
Chapter 6: General Discussion and Future Directions
4.6 General Discussion
As the global demand for soybean is increasing significantly, there is a dire need to
accelerate and develop different breeding strategies for this strategic crop to meet the
increasing demand. One of the breeding strategies is the analytical breeding strategy,
which focuses on improving yield components such as maturity, NP, NRNP, RNP, and
PP rather than the yield per se. However, measuring yield-related traits in a large plant
breeding population is still a challenge for plant breeders. In addition, breeding for multiple
traits simultaneously requires previous knowledge about the physiological and genetic
structure of each target trait. Due to recent advances in high throughout phenotyping,
genotyping, and big data analysis methods, breeders can develop different pipelines for
selecting superior genotypes at early growth stages in a less costly manner.
One of the objectives of this thesis was to demonstrate the best soybean growth
stage for measuring hyperspectral reflectance and evaluating the three most commonly
used ML algorithms (SVM, MLP, and RF) along with introducing the E-S method in
predicting the soybean yield using reflectance variables. The soybean R5 growth stage
was identified as a better stage than R4 for measuring hyperspectral reflectance. In
addition to using 62 reflectance bands as the full variables, the RFE method was used to
reduce the dimensionality of the data. Therefore, 21 most important bands were selected
238
as reflectance variables. Using both full and selected reflectance variables, RF
outperformed all individual algorithms. Therefore, RF was selected as the metaClassifier
for E–S. E–S had the highest prediction accuracy as one of the ensemble approaches
compared to an individual ML algorithm. Therefore, E–S was recommended as a reliable
and appropriate ML algorithm for predicting the soybean yield using reflectance variables.
This study provides an applicable pipeline for using hyperspectral reflectance data and
suitable ML algorithms for the development of high-yielding soybeans, which can be used
in large soybean breeding programs for selecting high-yielding soybeans at pre-
harvesting stages. The developed methodology in this thesis can open a new window in
using spectral reflectance for selecting high-yielding genotypes in different crops.
Having reliable estimations of the optimum values for highly heritable secondary traits
such as HVI that are strongly associated with yield and FBIO can guide breeders to
construct efficient breeding programs focused on designing and selecting “ideotype”
genotypes for their target traits. In this thesis, we demonstrated the significant correlations
of several HVI, including the red region, especially the 670 nm and 800 nm wavelengths
in the NIR region, with soybean seed yield and FBIO production. We also developed EB
and DNN algorithms as reliable tools for predicting the soybean seed yield and FBIO from
HVI data collected at early growth stages. Furthermore, for the first time, we linked DNN
to the SPEA2 to estimate the optimum values of HVI in soybean genotypes with
maximized yield and FBIO. These optimum values of HVI are suggested to be validated
239
in new and independent breeding populations before being used by soybean breeders in
their cultivar development programs for selecting high-yielding genotypes with high FBIO
potential.
Efficient breeding approaches for improving the genetic potential of complex traits
such as yield in soybean can be developed based on secondary traits that govern the
final performance of the complex trait. Thus, in order to increase the genetic yield potential
in soybean, breeders can select soybean genotypes with improved yield component
traits. However, measuring yield components in a large soybean breeding program is
time-consuming and labor-intensive, which limits the implication of this approach in
cultivar development programs. Another objective of this study was to evaluate the
potential use of yield component traits for estimating final seed yield in soybean using
different ML and E-B algorithms. The results showed that RBF is a reliable solo ML
algorithm for predicting the soybean seed yield from its component traits. However, an E-
B algorithm that was built by combining the three ML and using RBF as its metaClassifier
outperformed all individual ML algorithms, therefore, it is recommended for predicting the
soybean seed yield exploiting yield component traits. In this study, for the first time, we
coupled E-B algorithm with GA in order to estimate optimum values of yield component
traits in a theoretical genotype in which the yield is maximized using the real field data.
The results are promising but it is recommended that they be evaluated in new and
independent breeding populations before their routine use in cultivar development
240
programs for selecting high-yielding potential genotypes. This information can also be
used, in combination with molecular marker technology, for developing genomic-based
toolkits that can be used for selecting genotypes with improved genetic yield potential at
early generations.
A better understanding of the genetic architecture of the yield component traits in
soybean may enable breeders to establish more efficient selection strategies for
developing high-yielding cultivars with improved genetic gains. Major yield components
such as maturity, NP, NRNP, RNP, and PP play important roles in determining the overall
yield production in soybean. This study verified the importance of those traits, using
correlation and distribution analyses, in determination of the total soybean seed yield.
Furthermore, by testing different conventional and ML-mediated GWAS methods, SVR-
mediated GWAS outperformed all the other methods.
Therefore, SVR-mediated GWAS is recommended as an alternative to conventional
GWAS methods because it exhibited the greatest power for detecting genomic regions
associated with complex traits such as yield and its components in soybean, and possibly
other crop species. To the best of our knowledge, this study is the first attempt in which
SVR was used for GWAS analyses in plants. In order to verify the causal relationship
between identified QTL and the target phenotypic traits, we identified candidate genes
within each QTL using gene annotation procedures and information. The results
demonstrated the efficiency of SVR-mediated GWAS in detecting the strong QTL that can
241
be used in marker-assisted selection. Further investigation is recommended to confirm
the efficiency of SVR-mediated GWAS in detecting associated genomic regions in other
plant species.
4.7 Future Directions
Breeding for complex traits requires different environments to validate and test the
stability of the complex trait. In this study, soybean yield, hyperspectral reflectance, and
yield components were measured in four environments consisting of two locations grown
over two years. It would be valuable to test the developed pipelines in more environments
to test the stability of the results across environments.
Each ML algorithm has its own function and parameters to be optimized based on
the given problem. However, some algorithms may respond well to specific problems. For
instance, we demonstrated the efficiency of ensemble bagging and stacking in predicting
soybean seed yield using hyperspectral reflectance and yield components, respectively.
However, the use of all the tested ML algorithms is new in the plant breeding area, and it
would be valuable to have a deep understanding of the use of other ML algorithms to
predict complex traits using secondary related traits.
Breeding for multiple traits concurrently would be a challenge in a breeding program
since each trait has its limitation and interaction with other traits. Therefore, understanding
242
the optimum value of each trait in a given population would be valuable to set up a
breeding strategy efficiently. Optimization algorithms would be helpful to estimate the
optimum value of each trait for maximizing the complex trait. However, each optimization
algorithm has its merits and demerits. This thesis evaluates the two most well-known
optimization algorithms, each from a separate optimization group: GA, as single-objective
optimization algorithms, and SPEA2, as a multi-objective evolutionary optimization
algorithm. Although both tested optimization algorithms had an acceptable performance
in finding solutions for a given soybean population, it would be valuable to conduct an
experiment and test different optimization algorithms and provide comprehensive
optimization pipelines for plant breeders.
One of the aims of this study was to estimate the optimum value of each yield
component and HVI in soybeans with maximized seed yield and FBIO production. This
result can be used for establishing a recurrent selection program in which the yield
component traits are gradually accumulated and optimized for sustainable improvement
of seed yield and its genetic gain.
In order to detect genomic regions associated with traits of interest, four GWAS
methods were used in this study. Also, in order to verify the associated genomic regions,
candidate genes and QTL were estimated using different databases. Although associated
genomic regions were detected using the tested GWAS methods, it would be more
243
valuable to conduct gene expression experiments and validate findings not only based
on DNA variants but also through gene expression analysis methods.
Markers associated with some of the detected QTL might be useful for MAS
program. We recommend evaluating and validating the detected QTL in this study in
larger and different breeding populations.
244
References
Aanonsen, S.I., Nævdal, G., Oliver, D.S., Reynolds, A.C., and Vallès, B. (2009). The ensemble Kalman filter in reservoir engineering--a review. Spe Journal 14(03), 393-412.
Acevedo, E., and Aleppo, S. (1991). Improvement of winter cereal crops in Mediterranean environments: Use of yield, morphological and physiological traits. Breeding for drought resistance in wheat 12, 188.
Acharjee, A., Prentice, P., Acerini, C., Smith, J., Hughes, I.A., Ong, K., et al. (2017). The translation of lipid profiles to nutritional biomarkers in the study of infant metabolism. Metabolomics 13(3), 25.
Adee, E. (2015). Effects of seed treatment on sudden death syndrome symptoms and soybean yield. Kansas Agricultural Experiment Station Research Reports 1(2), 3.
Aghighi, H., Azadbakht, M., Ashourloo, D., Shahrabi, H.S., and Radiom, S. (2018). Machine learning regression techniques for the silage maize yield prediction using time-series images of landsat 8 OLI. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11(12), 4563-4577.
Al-Abassi, A., Karimipour, H., HaddadPajouh, H., Dehghantanha, A., and Parizi, R.M. (2020). "Industrial big data analytics: challenges and opportunities," in Handbook of Big Data Privacy. Springer), 37-61.
Alam, M., and Mukhopadhyay, D. (Year). "How secure are deep learning algorithms from side-channel based reverse engineering?", in: Proceedings of the 56th Annual Design Automation Conference 2019), 1-2.
Albetis, J., Duthoit, S., Guttler, F., Jacquin, A., Goulard, M., Poilvé, H., et al. (2017). Detection of Flavescence dorée grapevine disease using unmanned aerial vehicle (UAV) multispectral imagery. Remote Sensing 9(4), 308.
Alekcevetch, J.C., de Lima Passianotto, A.L., Ferreira, E.G.C., Dos Santos, A.B., da Silva, D.C.G., Dias, W.P., et al. (2021). Genome-wide association study for resistance to the Meloidogyne javanica causing root-knot nematode in soybean. Theoretical and Applied Genetics, 1-16.
Alemu, A., Feyissa, T., Maccaferri, M., Sciara, G., Tuberosa, R., Ammar, K., et al. (2021). Genome-wide association analysis unveils novel QTLs for seminal root system architecture traits in Ethiopian durum wheat. BMC genomics 22(1), 1-16.
Alexandratos, N., and Bruinsma, J. (2012). World agriculture towards 2030/2050: the 2012 revision.
Ali, I., Cawkwell, F., Green, S., and Dwyer, N. (Year). "Application of statistical and machine learning models for grassland yield estimation based on a hypertemporal satellite remote sensing time series", in: 2014 IEEE Geoscience and Remote Sensing Symposium: IEEE), 5060-5063.
245
Alirezanejad, M., Enayatifar, R., Motameni, H., and Nematzadeh, H. (2020). Heuristic filter feature selection methods for medical datasets. Genomics 112(2), 1173-1181.
Alpaydin, E. (2020). Introduction to machine learning. MIT press. Alsahaf, A., Azzopardi, G., Ducro, B., Hanenberg, E., Veerkamp, R.F., and Petkov, N.
(2018). Prediction of slaughter age in pigs and assessment of the predictive value of phenotypic and genetic information using random forest. Journal of animal science 96(12), 4935-4943.
Alves, M.L., Bento-Silva, A., Gaspar, D., Paulo, M., Brites, C., Mendes-Moreira, P., et al. (2020). Volatilome–Genome-Wide Association Study on Wholemeal Maize Flour. Journal of Agricultural and Food Chemistry 68(29), 7809-7818.
Aparicio, N., Villegas, D., Araus, J., Casadesus, J., and Royo, C. (2002). Relationship between growth traits and spectral vegetation indices in durum wheat. Crop science 42(5), 1547-1555.
Aparicio, N., Villegas, D., Casadesus, J., Araus, J.L., and Royo, C. (2000). Spectral vegetation indices as nondestructive tools for determining durum wheat yield. Agronomy Journal 92(1), 83-91.
Arab, M.M., Yadollahi, A., Eftekhari, M., Ahmadi, H., Akbari, M., and Khorami, S.S. (2018). Modeling and Optimizing a New Culture Medium for In Vitro Rooting of G× N15 Prunus Rootstock using Artificial Neural Network-Genetic Algorithm. Scientific reports 8(1), 9977.
Araghinejad, S. (2013). Data-driven modeling: using MATLAB® in water resources and environmental engineering. Springer Science & Business Media.
Araus, J., Casadesus, J., and Bort, J. (2001). "Recent tools for the screening of physiological traits determining yield. Application of Physiology in Wheat Breeding".).
Araus, J., Slafer, G., Reynolds, M., and Royo, C. (2002). Plant breeding and drought in C3 cereals: what should we breed for? Annals of botany 89(7), 925-940.
Araus, J.L., and Cairns, J.E. (2014). Field high-throughput phenotyping: the new crop breeding frontier. Trends in plant science 19(1), 52-61.
Araus, J.L., Kefauver, S.C., Zaman-Allah, M., Olsen, M.S., and Cairns, J.E. (2018). Translating high-throughput phenotyping into genetic gain. Trends in plant science 23(5), 451-466.
Araya, D.B., Grolinger, K., ElYamany, H.F., Capretz, M.A., and Bitsuamlak, G. (2017). An ensemble learning framework for anomaly detection in building energy consumption. Energy and Buildings 144, 191-206.
Arshadi, N., Chang, B., and Kustra, R. (Year). "Predictive modeling in case-control single-nucleotide polymorphism studies in the presence of population stratification: a case study using Genetic Analysis Workshop 16 Problem 1 dataset", in: BMC proceedings: Springer), 1-5.
246
Asif, H., Alliey-Rodriguez, N., Keedy, S., Tamminga, C.A., Sweeney, J.A., Pearlson, G., et al. (2020). GWAS significance thresholds for deep phenotyping studies can depend upon minor allele frequencies and sample size. Molecular psychiatry, 1-8.
Auria, L., and Moro, R.A. (2008). Support vector machines (SVM) as a technique for solvency analysis.
Awad, M., and Khanna, R. (2015). "Support vector regression," in Efficient learning machines. Springer), 67-80.
Babar, M., Reynolds, M., Van Ginkel, M., Klatt, A., Raun, W., and Stone, M. (2006). Spectral reflectance to estimate genetic variation for in‐season biomass, leaf chlorophyll, and canopy temperature in wheat. Crop science 46(3), 1046-1057.
Bajwa, S.G., Rupe, J.C., and Mason, J. (2017). Soybean disease monitoring with leaf reflectance. Remote Sensing 9(2), 127.
Bala, S., and Islam, A. (2009). Correlation between potato yield and MODIS‐derived vegetation indices. International Journal of Remote Sensing 30(10), 2491-2507.
Banerjee, S., Zeng, L., Schunkert, H., and Söding, J. (2018). Bayesian multiple logistic regression for case-control GWAS. PLoS genetics 14(12), e1007856.
Bao, Y., Kurle, J.E., Anderson, G., and Young, N.D. (2015). Association mapping and genomic prediction for resistance to sudden death syndrome in early maturing soybean germplasm. Molecular Breeding 35(6), 1-14.
Bao, Y., Vuong, T., Meinhardt, C., Tiffin, P., Denny, R., Chen, S., et al. (2014). Potential of association mapping and genomic selection to explore PI 88788 derived soybean cyst nematode resistance. The Plant Genome 7(3).
Bassett, A., Kamfwa, K., Ambachew, D., and Cichy, K. (2021). Genetic variability and genome-wide association analysis of flavor and texture in cooked beans (Phaseolus vulgaris L.). Theoretical and Applied Genetics, 1-20.
Bastidas, A., Setiyono, T., Dobermann, A., Cassman, K.G., Elmore, R.W., Graef, G.L., et al. (2008). Soybean sowing date: The vegetative, reproductive, and agronomic impacts. Crop Science 48(2), 727-740.
Bates, D., Mächler, M., Bolker, B., and Walker, S. (2014). Fitting linear mixed-effects models using lme4. arXiv preprint arXiv:1406.5823.
Bates, D., Sarkar, D., Bates, M.D., and Matrix, L. (2007). The lme4 package. R package version 2(1), 74.
Bäurle, I., and Dean, C. (2008). Differential interactions of the autonomous pathway RRM proteins and chromatin regulators in the silencing of Arabidopsis targets. PLoS One 3(7), e2733.
Bawazeer, S.A., Baakeem, S.S., and Mohamad, A.A. (2021). New Approach for Radial Basis Function Based on Partition of Unity of Taylor Series Expansion with Respect to Shape Parameter. Algorithms 14(1), 1.
Belayneh, A., Adamowski, J., Khalil, B., and Ozga-Zielinski, B. (2014). Long-term SPI drought forecasting in the Awash River Basin in Ethiopia using wavelet neural
247
network and wavelet support vector regression models. Journal of Hydrology 508, 418-429.
Bendig, J., Bolten, A., Bennertz, S., Broscheit, J., Eichfuss, S., and Bareth, G. (2014). Estimating biomass of barley using crop surface models (CSMs) derived from UAV-based RGB imaging. Remote Sensing 6(11), 10395-10412.
Benedetti, P., Ienco, D., Gaetano, R., Ose, K., Pensa, R.G., and Dupuy, S. (2018). $ M^ 3\text {Fusion} $: A Deep Learning Architecture for Multiscale Multimodal Multitemporal Satellite Data Fusion. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11(12), 4939-4949.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57(1), 289-300.
Bhojani, S.H., and Bhatt, N. (2020). Wheat crop yield prediction using new activation functions in neural network. Neural Computing and Applications 32, 13941–13951.
Bisong, E. (2019). "The Multilayer Perceptron (MLP)," in Building Machine Learning and Deep Learning Models on Google Cloud Platform. (Berkeley, CA, USA: Springer), 401-405.
Björne, J., Kaewphan, S., and Salakoski, T. (Year). "UTurku: drug named entity recognition and drug-drug interaction extraction using SVM classification and domain knowledge", in: Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)), 651-659.
Blackburn, G.A. (2007). Hyperspectral remote sensing of plant pigments. Journal of experimental botany 58(4), 855-867.
Blaikie, P., and Brookfield, H. (2015). Land degradation and society. Routledge. Block, M.A., and Jouhet, J. (2015). Lipid trafficking at endoplasmic reticulum–chloroplast
membrane contact sites. Current opinion in cell biology 35, 21-29. Board, J., Kang, M., and Bodrero, M. (2003). Yield components as indirect selection
criteria for late-planted soybean cultivars. Agronomy journal 95(2), 420-429. Board, J., Kumudini, S., Omielan, J., Prior, E., and Kahlon, C. (2010). Yield response of
soybean to partial and total defoliation during the seed-filling period. Crop science 50(2), 703-712.
Boerma, H. (1979). Comparison of Past and Recently Developed Soybean Cultivars in Maturity Groups VI, VII, and VIII1. Crop Science 19(5), 611-613.
Botta, V., Louppe, G., Geurts, P., and Wehenkel, L. (2014). Exploiting SNP correlations within random forest for genome-wide association studies. PloS one 9(4), e93379.
Bowley, S. (1999). A hitchhiker's guide to statistics in plant biology. Guelph, Ont.: Any Old Subject Books.
Bozorg-Haddad, O., Azarnivand, A., Hosseini-Moghari, S.-M., and Loáiciga, H.A. (2016). Development of a comparative multiple criteria framework for ranking pareto
248
optimal solutions of a multiobjective reservoir operation problem. Journal of Irrigation Drainage Engineering 142(7), 04016019.
Breiman, L. (1996). Bagging predictors. Machine learning 24(2), 123-140. Breiman, L. (2001). Random forests. Machine learning 45(1), 5-32. Brown-Guedira, G., Thompson, J., Nelson, R., and Warburton, M. (2000). Evaluation of
Genetic Diversity of Soybean Introductions and North American Ancestors Using RAPD and SSR Markers Joint contribution of the USDA-ARS, and Illinois Agric. Exp. Stn. Crop Science 40(3), 815-823.
Bush, W.S., and Moore, J.H. (2012). Genome-wide association studies. PLoS Comput Biol 8(12), e1002822.
Butler, K.T., Davies, D.W., Cartwright, H., Isayev, O., and Walsh, A. (2018). Machine learning for molecular and materials science. Nature 559(7715), 547-555.
Buzzell, R. (1971). Inheritance of a soybean flowering response to fluorescent-daylength conditions. Canadian Journal of Genetics and Cytology 13(4), 703-707.
CABRERA‐BOSQUET, L., Albrizio, R., Nogues, S., and Araus, J.L. (2011). Dual Δ13C/δ18O response to water and nitrogen availability and its relationship with yield in field‐grown durum wheat. Plant, Cell & Environment 34(3), 418-433.
Cabrera‐Bosquet, L., Crossa, J., von Zitzewitz, J., Serret, M.D., and Luis Araus, J. (2012).
High‐throughput phenotyping and genomic selection: The frontiers of crop breeding converge. Journal of integrative plant biology 54(5), 312-320.
Cacuci, D.G., Ionescu-Bujor, M., and Navon, I.M. (2005). Sensitivity and uncertainty analysis, volume II: applications to large-scale systems. CRC press.
Cai, G., Yang, Q., Chen, H., Yang, Q., Zhang, C., Fan, C., et al. (2016). Genetic dissection of plant architecture and yield-related traits in Brassica napus. Scientific reports 6, 21625.
Cao, S., Xu, D., Hanif, M., Xia, X., and He, Z. (2020). Genetic architecture underpinning yield component traits in wheat. Theoretical and Applied Genetics, 1-13.
Carvalho, S., Macel, M., Schlerf, M., Moghaddam, F.E., Mulder, P.P., Skidmore, A.K., et al. (2013). Changes in plant defense chemistry (pyrrolizidine alkaloids) revealed through high-resolution spectroscopy. ISPRS journal of photogrammetry remote sensing 80, 51-60.
Casal, J.J. (2013). Photoreceptor signaling networks in plant responses to shade. Annual review of plant biology 64, 403-427.
Cassell, B.G. (2009). Using heritability for genetic improvement. Chandra, A.L., Desai, S.V., Guo, W., and Balasubramanian, V.N. (2020). Computer vision
with deep learning for plant phenotyping in agriculture: A survey. arXiv preprint arXiv:2006.11391.
Chang, H.-X., and Hartman, G.L. (2017). Characterization of insect resistance loci in the USDA soybean germplasm collection using genome-wide association studies. Frontiers in plant science 8, 670.
249
Chang, W., and Wickham, H. (2016). ggvis: Interactive Grammar of Graphics. R package version0 4.
Chappelle, E.W., Kim, M.S., and McMurtrey III, J.E. (1992). Ratio analysis of reflectance spectra (RARS): an algorithm for the remote estimation of the concentrations of chlorophyll a, chlorophyll b, and carotenoids in soybean leaves. Remote sensing of environment 39(3), 239-247.
Che, Z., Liu, H., Yi, F., Cheng, H., Yang, Y., Wang, L., et al. (2017). Genome-Wide Association Study Reveals Novel Loci for SC7 Resistance in a Soybean Mutant Panel. Frontiers in Plant Science 8(1771). doi: 10.3389/fpls.2017.01771.
Chen, J.-C., and Wang, Y.-M. (2020). Comparing Activation Functions in Modeling Shoreline Variation Using Multilayer Perceptron Neural Network. Water 12(5), 1281.
Chen, J.H., and Verghese, A. (2020). Planning for the Known Unknown: Machine Learning for Human Healthcare Systems. The American Journal of Bioethics 20(11), 1-3.
Chen, L., Huang, J., Wang, F., and Tang, Y. (2007). Comparison between back propagation neural network and regression models for the estimation of pigment content in rice leaves and panicles using hyperspectral data. International Journal of Remote Sensing 28(16), 3457-3478.
Chen, P., Haboudane, D., Tremblay, N., Wang, J., Vigneault, P., and Li, B. (2010). New spectral indicator assessing the efficiency of crop nitrogen treatment in corn and wheat. Remote Sensing of Environment 114(9), 1987-1997.
Chen, X.-w., and Jeong, J.C. (Year). "Enhanced recursive feature elimination", in: Sixth International Conference on Machine Learning and Applications (ICMLA 2007): IEEE), 429-435.
Chenou, J., Hsieh, G., and Fields, T. (Year). "Radial Basis Function Network: Its Robustness and Ability to Mitigate Adversarial Examples", in: 2019 International Conference on Computational Science and Computational Intelligence (CSCI): IEEE), 102-106.
Chlingaryan, A., Sukkarieh, S., and Whelan, B. (2018). Machine learning approaches for crop yield prediction and nitrogen status estimation in precision agriculture: A review. Computers and electronics in agriculture 151, 61-69.
Chouhan, S.S., Kaul, A., Singh, U.P., and Jain, S. (2018). Bacterial foraging optimization based radial basis function neural network (BRBFNN) for identification and classification of plant leaf diseases: An automatic approach towards plant pathology. IEEE Access 6, 8852-8863.
Christenson, B.S., Schapaugh, W.T., An, N., Price, K.P., Prasad, V., and Fritz, A.K. (2016). Predicting Soybean Relative Maturity and Seed Yield Using Canopy Reflectance. Crop Science 56(2), 625-643.
250
Chun, H., and Keleş, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(1), 3-25.
Churchill, G.A., and Doerge, R.W. (1994). Empirical threshold values for quantitative trait mapping. Genetics 138(3), 963-971.
Ciampitti, I.A., and Vyn, T.J. (2012). Physiological perspectives of changes over time in maize yield dependency on nitrogen uptake and associated nitrogen efficiencies: A review. Field Crops Research 133, 48-67.
Colletti, A., Attrovio, A., Boffa, L., Mantegna, S., and Cravotto, G. (2020). Valorisation of By-Products from Soybean (Glycine max (L.) Merr.) Processing. Molecules 25(9), 2129.
Collins, N.C., Tardieu, F., and Tuberosa, R. (2008). Quantitative trait loci and crop performance under abiotic stress: where do we stand? Plant physiology 147(2), 469-486.
Connor, D.J., Loomis, R.S., and Cassman, K.G. (2011). Crop ecology: productivity and management in agricultural systems. Cambridge University Press.
Contreras-Soto, R.I., Mora, F., de Oliveira, M.A.R., Higashi, W., Scapim, C.A., and Schuster, I. (2017). A genome-wide association study for agronomic traits in soybean using SNP markers and SNP-based haplotype analysis. PloS one 12(2), e0171105.
Cook, D.E., Bayless, A.M., Wang, K., Guo, X., Song, Q., Jiang, J., et al. (2014). Distinct copy number, coding sequence, and locus methylation patterns underlie Rhg1-mediated soybean resistance to soybean cyst nematode. Plant physiology 165(2), 630-647.
Cooper, M., van Eeuwijk, F.A., Hammer, G.L., Podlich, D.W., and Messina, C. (2009). Modeling QTL for complex traits: detection and context for plant breeding. Current opinion in plant biology 12(2), 231-240.
Copley, T.R., Duceppe, M.-O., and O’Donoughue, L.S. (2018). Identification of novel loci associated with maturity and yield traits in early maturity soybean plant introduction lines. BMC genomics 19(1), 1-12.
Correia, D.L., Bouachir, W., Gervais, D., Pureswaran, D., Kneeshaw, D.D., and De Grandpré, L. (2020). Leveraging artificial intelligence for large-scale plant phenology studies from noisy time-lapse images. IEEE Access 8, 13151-13160.
Cortes, C., and Vapnik, V. (1995). Support vector machine. Machine learning 20(3), 273-297.
Crane-Droesch, A. (2018). Machine learning methods for crop yield prediction and climate change impact assessment in agriculture. Environmental Research Letters 13(11), 114003.
Crozier, A., Jaganath, I.B., and Clifford, M.N. (2009). Dietary phenolics: chemistry, bioavailability and effects on health. Natural product reports 26(8), 1001-1043.
251
Cui, S., and Yu, D. (2005). Estimates of relative contribution of biomass, harvest index and yield components to soybean yield improvements in China. Plant Breeding 124(5), 473-476.
Dababat, A., Arif, M.A.R., Toktay, H., Atiya, O., Shokat, S., Gul, E., et al. (2021). A GWAS to identify the cereal cyst nematode (Heterodera filipjevi) resistance loci in diverse wheat prebreeding lines. Journal of Applied Genetics, 1-6.
Dao, D.-N., and Guo, L.-X. (2020a). New hybrid between SPEA/R with deep neural network: application to predicting the multi-objective optimization of the stiffness parameter for powertrain mount systems. Journal of Low Frequency Noise, Vibration and Active Control 39(4), 53–68.
Dao, D.-N., and Guo, L.-X. (2020b). New hybrid SPEA/R-deep learning to predict optimization parameters of cascade FOPID controller according engine speed in powertrain mount system control of half-car dynamic model. Journal of Intelligent & Fuzzy Systems 39(Preprint), 53–68.
Dasgupta, K., Mandal, B., Dutta, P., Mandal, J.K., and Dam, S. (2013). A genetic algorithm (ga) based load balancing strategy for cloud computing. Procedia Technology 10, 340-347.
Datt, B. (1999). A new reflectance index for remote sensing of chlorophyll content in higher plants: tests using Eucalyptus leaves. Journal of Plant Physiology 154(1), 30-36.
De'ath, G., and Fabricius, K.E. (2000). Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81(11), 3178-3192.
De Bruin, J.L., and Pedersen, P. (2009). Growth, Yield, and Yield Component Changes among Old and New Soybean Cultivars All rights reserved. No part of this periodical may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Agronomy journal 101(1), 124-130.
De Castro, A.-I., Jurado-Expósito, M., Gómez-Casero, M.-T., and López-Granados, F. (2012). Applying neural networks to hyperspectral and multispectral field data for discrimination of cruciferous weeds in winter crops. The Scientific World Journal 2012.
De Castro, A.I., Torres-Sánchez, J., Peña, J.M., Jiménez-Brenes, F.M., Csillik, O., and López-Granados, F. (2018). An automatic random forest-OBIA algorithm for early weed mapping between and within crop rows using UAV imagery. Remote Sensing 10(2), 285.
De Pretto, C., Giordano, R.d.L.C., Tardioli, P.W., and Costa, C.B.B. (2018). Possibilities for producing energy, fuels, and chemicals from soybean: a biorefinery concept. Waste and Biomass Valorization 9(10), 1703-1730.
252
De Wit, M., Galvão, V.C., and Fankhauser, C. (2016). Light-mediated hormonal regulation of plant growth and development. Annual review of plant biology 67, 513-537.
Deering, D.W. (1978). Rangeland reflectance characteristics measured by aircraft and spacecraft sensors. Ph. D Dissertation. Texas A&M Universtiy.
Dennis, T. (2016). Using R and ggvis to Create Interactive Graphics for Exploratory Data Analysis. Data Visualization: A Guide to Visual Storytelling for Libraries, 149.
Denton, S.M., and Salleb-Aouissi, A. (2020). A Weighted Solution to SVM Actionability and Interpretability. arXiv preprint arXiv:2012.03372.
Deore, B., Kyatham, A., and Narkhede, S. (Year). "A novel approach to ensemble MLP and random forest for network security", in: ITM Web of Conferences: EDP Sciences), 03003.
Devlin, B., and Roeder, K. (1999). Genomic control for association studies. Biometrics 55(4), 997-1004.
Dhanapal, A.P., Ray, J.D., Singh, S.K., Hoyos-Villegas, V., Smith, J.R., Purcell, L.C., et al. (2015a). Genome-wide association study (GWAS) of carbon isotope ratio (δ 13 C) in diverse soybean [Glycine max (L.) Merr.] genotypes. Theoretical and Applied Genetics 128(1), 73-91.
Dhanapal, A.P., Ray, J.D., Singh, S.K., Hoyos-Villegas, V., Smith, J.R., Purcell, L.C., et al. (2015b). Association mapping of total carotenoids in diverse soybean genotypes based on leaf extracts and high-throughput canopy spectral reflectance measurements. PloS one 10(9), e0137213.
Dhanapal, A.P., Ray, J.D., Smith, J.R., Purcell, L.C., and Fritschi, F.B. (2018). Identification of Novel Genomic Loci Associated with Soybean Shoot Tissue Macro and Micronutrient Concentrations. The plant genome 11(2), 170066.
Di Leo, G., and Sardanelli, F. (2020). Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach. European radiology experimental 4(1), 1-8.
Dias, R., and Torkamani, A. (2019). Artificial intelligence in clinical and genomic diagnostics. Genome medicine 11(1), 1-12.
Dietterich, T.G. (Year). "Ensemble methods in machine learning", in: International workshop on multiple classifier systems: Springer), 1-15.
Ding, K., and Kullo, I.J. (2007). Methods for the selection of tagging SNPs: a comparison of tagging efficiency and performance. European Journal of Human Genetics 15(2), 228-236.
Ditta, G., Pinyopich, A., Robles, P., Pelaz, S., and Yanofsky, M.F. (2004). The SEP4 gene of Arabidopsis thaliana functions in floral organ and meristem identity. Current biology 14(21), 1935-1940.
Divina, F., Gilson, A., Goméz-Vela, F., García Torres, M., and Torres, J.F. (2018). Stacking ensemble learning for short-term electricity consumption forecasting. Energies 11(4), 949.
253
Doerge, R.W., and Churchill, G.A. (1996). Permutation tests for multiple loci affecting a quantitative character. Genetics 142(1), 285-294.
Drucker, H., and Cortes, C. (Year). "Boosting decision trees", in: Advances in neural information processing systems), 479-485.
Du, J., Wang, S., He, C., Zhou, B., Ruan, Y.-L., and Shou, H. (2017). Identification of regulatory networks and hub genes controlling soybean seed set and size using RNA sequencing analysis. Journal of Experimental Botany 68(8), 1955-1972.
Du, J., and Xu, Y. (2017). Hierarchical deep neural network for multivariate regression. Pattern Recognition 63, 149-157.
Du, Y., Zhao, Q., Li, S., Yao, X., Xie, F., and Zhao, M. (2019). Shoot/root interactions affect soybean photosynthetic traits and yield formation: a case study of grafting with record-yield cultivars. Frontiers in plant science 10, 445.
Duan, K.-B., Rajapakse, J.C., Wang, H., and Azuaje, F. (2005). Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE transactions on nanobioscience 4(3), 228-234.
Duan, T., Zheng, B., Guo, W., Ninomiya, S., Guo, Y., and Chapman, S.C. (2017). Comparison of ground cover estimates from experiment plots in cotton, sorghum and sugarcane based on images and ortho-mosaics captured by UAV. Functional Plant Biology 44(1), 169-183.
Dubey, A., Kumar, A., Abd_Allah, E.F., Hashem, A., and Khan, M.L. (2019). Growing more with less: Breeding and developing drought resilient soybean to improve food security. Ecological Indicators 105, 425-437. doi: https://doi.org/10.1016/j.ecolind.2018.03.003.
Dumschott, K., Schmidt, M.H.-W., Chawla, H.S., Snowdon, R., and Usadel, B. (2020). Oxford Nanopore Sequencing: New opportunities for plant genomics? Journal of Experimental Botany.
Dutamo, D., Alamerew, S., Eticha, F., and Assefa, E. (2015). Genetic variability in bread wheat (Triticum aestivum L.) germplasm for yield and yield component traits. Journal of Biology, Agriculture and Healthcare 5(17), 140-147.
Eck, H.V. (1987). Plant water stress at various growth stages and growth and yield of soybeans. Field crops research v. 17(no. 1), pp. 1-16-1989 v.1917 no.1981.
Edelmann, A., Wolff, T., Montagne, D., and Bail, C.A. (2020). Computational Social Science and Sociology. Annual Review of Sociology 46.
Egli, D. (2005). Flowering, pod set and reproductive success in soya bean. Journal of Agronomy and crop science 191(4), 283-291.
Egli, D. (2013). The relationship between the number of nodes and pods in soybean communities. Crop Science 53(4), 1668-1676.
Eltaher, S., Baenziger, P.S., Belamkar, V., Emara, H.A., Nower, A.A., Salem, K.F., et al. (2021). GWAS revealed effect of genotype× environment interactions for grain yield of Nebraska winter wheat. BMC genomics 22(1), 1-14.
254
Emily, M. (2018). Power comparison of Cochran-Armitage trend test against allelic and genotypic tests in large-scale case-control genetic association studies. Statistical methods in medical research 27(9), 2657-2673.
Eraslan, G., Avsec, Ž., Gagneur, J., and Theis, F.J. (2019). Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics 20(7), 389-403.
Esposito, S., Carputo, D., Cardi, T., and Tripodi, P. (2020). Applications and Trends of Machine Learning in Genomics and Phenomics for Next-Generation Breeding. Plants 9(1), 34.
Etminan, A., Pour-Aboughadareh, A., Mohammadi, R., Shooshtari, L., Yousefiazarkhanian, M., and Moradkhani, H. (2019). Determining the best drought tolerance indices using artificial neural network (ANN): Insight into application of intelligent agriculture in agronomy and plant breeding. Cereal Research Communications 47(1), 170-181.
Fabiyi, E. (2006). Soyabean processing, utilization and health benefits. Pakistan Journal of Nutrition 5(5), 453-457.
Falk, K.G., Jubery, T.Z., Mirnezami, S.V., Parmley, K.A., Sarkar, S., Singh, A., et al. (2020). Computer vision and machine learning enabled soybean root phenotyping pipeline. Plant Methods 16(1), 5. doi: 10.1186/s13007-019-0550-5.
Fang, C., Ma, Y., Wu, S., Liu, Z., Wang, Z., Yang, R., et al. (2017). Genome-wide association studies dissect the genetic networks underlying agronomical traits in soybean. Genome biology 18(1), 1-14.
Farifteh, J., Van der Meer, F., Atzberger, C., and Carranza, E.J.M. (2007). Quantitative analysis of salt-affected soil reflectance spectra: A comparison of two adaptive methods (PLSR and ANN). Remote Sensing of Environment 110(1), 59-78. doi: https://doi.org/10.1016/j.rse.2007.02.005.
Fehr, W., and Caviness, C. (1971). Burmood DT, Penington J S. Development description of soybean, Glycine max (L) Mer. Crop Science 11, 929-931.
Fehr, W., Caviness, C., Burmood, D., and Pennington, J. (1971). Stage of development descriptions for soybeans, Glycine Max (L.) Merrill 1. Crop science 11(6), 929-931.
Fehr, W., Lawrence, B., and Thompson, T. (1981). Critical Stages of Development for Defoliation of Soybean 1. Crop Science 21(2), 259-262.
Feilhauer, H., Somers, B., and van der Linden, S. (2017). Optical trait indicators for remote sensing of plant species composition: Predictive power and seasonal variability. Ecological Indicators 73, 825-833.
Feng, L., Li, Y., Wang, Y., and Du, Q. (2020a). Estimating hourly and continuous ground-level PM2. 5 concentrations using an ensemble learning algorithm: The ST-stacking model. Atmospheric Environment 223, 117242.
255
Feng, L., Zhang, Z., Ma, Y., Du, Q., Williams, P., Drewry, J., et al. (2020b). Alfalfa Yield Prediction Using UAV-Based Hyperspectral Imagery and Ensemble Learning. Remote Sensing 12(12), 2028.
Feng, P., Ma, J., Sun, C., Xu, X., and Ma, Y. (2018). A novel dynamic Android malware detection system with ensemble learning. IEEE Access 6, 30996-31011.
Ferguson, R., and Rundquist, D.J.P.a.b. (2018). Remote sensing for site-specific crop management. (precisionagbasics), 103-118.
Fernandes, J.L., Ebecken, N.F.F., and Esquerdo, J.C.D.M. (2017). Sugarcane yield prediction in Brazil using NDVI time series and neural networks ensemble. International Journal of Remote Sensing 38(16), 4631-4644.
Fernández, C.I., Leblon, B., Haddadi, A., Wang, K., and Wang, J. (2020). Potato late blight detection at the leaf and canopy levels based in the red and red-edge spectral regions. Remote Sensing 12(8), 1292.
Ferris, R., Wheeler, T., Hadley, P., and Ellis, R. (1998). Recovery of photosynthesis after environmental stress in soybean grown under elevated CO2. Crop Science 38(4), 948-955.
Figueiredo, D.D., and Köhler, C. (2018). Auxin: a molecular trigger of seed development. Genes & development 32(7-8), 479-490.
Filippa, G., Cremonese, E., Migliavacca, M., Galvagno, M., Sonnentag, O., Humphreys, E., et al. (2018a). NDVI derived from near-infrared-enabled digital cameras: Applicability across different plant functional types. Agricultural and Forest Meteorology 249, 275-285.
Filippa, G., Cremonese, E., Migliavacca, M., Galvagno, M., Sonnentag, O., Humphreys, E., et al. (2018b). NDVI derived from near-infrared-enabled digital cameras: Applicability across different plant functional types. Agricultural and Forest Meteorology 249, 275-285. doi: https://doi.org/10.1016/j.agrformet.2017.11.003.
Filippi, A.M., and Jensen, J.R. (2006). Fuzzy learning vector quantization for hyperspectral coastal vegetation classification. Remote Sensing of Environment 100(4), 512-530.
Fischer, C., Pardos, Z.A., Baker, R.S., Williams, J.J., Smyth, P., Yu, R., et al. (2020). Mining big data in education: Affordances and challenges. Review of Research in Education 44(1), 130-160.
Fitzgerald, G., Rodriguez, D., and O’Leary, G. (2010). Measuring and predicting canopy nitrogen nutrition in wheat using a spectral index—The canopy chlorophyll content index (CCCI). Field Crops Research 116(3), 318-324.
Fletcher, R.S., Smith, J.R., Mengistu, A., and Ray, J.D. (2014). Relationships between microsclerotia content and hyperspectral reflectance data in soybean tissue infected by Macrophomina phaseolina. American Journal of Plant Sciences 5(25), 3737-3744.
Fletcher, T. (2009). Support vector machines explained. Tutorial paper., Mar, 28.
256
Flint-Garcia, S.A., Thornsberry, J.M., and Buckler IV, E.S. (2003). Structure of linkage disequilibrium in plants. Annual review of plant biology 54(1), 357-374.
Foley, J.A., Ramankutty, N., Brauman, K.A., Cassidy, E.S., Gerber, J.S., Johnston, M., et al. (2011). Solutions for a cultivated planet. Nature 478(7369), 337.
Ford, D.M., Shibles, R., and Green, D. (1983). Growth and Yield of Soybean Lines Selected for Divergent Leaf Photosynthetic Ability1. Crop Science 23(3), 517-520.
Fukushima, D. (1981). Soy proteins for foods centering around soy sauce and tofu. Journal of the American Oil Chemists’ Society 58(3), 346-354.
Furbank, R.T., and Tester, M. (2011). Phenomics–technologies to relieve the phenotyping bottleneck. Trends in plant science 16(12), 635-644.
G. Jones, T., Coops, N.C., and Sharma, T. (2010). Employing ground-based spectroscopy for tree-species differentiation in the Gulf Islands National Park Reserve. International Journal of Remote Sensing 31(4), 1121-1127.
Gal, T., and Greenberg, H.J. (2012). Advances in sensitivity analysis and parametric programming. New York, NY, USA: Springer Science & Business Media.
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. (2011). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(4), 463-484.
Galli, G., Horne, D.W., Collins, S.D., Jung, J., Chang, A., Fritsche‐Neto, R., et al. (2020). Optimization of UAS‐based high‐throughput phenotyping to estimate plant health and grain yield in sorghum. The Plant Phenome Journal 3(1), e20010.
Gandhi, N., Armstrong, L.J., Petkar, O., and Tripathy, A.K. (Year). "Rice crop yield prediction in India using support vector machines", in: 2016 13th International Joint Conference on Computer Science and Software Engineering (JCSSE): IEEE), 1-5.
Gao, X., Liang, W., Yin, C., Ji, S., Wang, H., Su, X., et al. (2010). The SEPALLATA-like gene OsMADS34 is required for rice inflorescence and spikelet development. Plant Physiology 153(2), 728-740.
Gardner, M.W., and Dorling, S. (1998). Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric environment 32(14-15), 2627-2636.
Gaut, B.S., and Long, A.D. (2003). The lowdown on linkage disequilibrium. The Plant Cell 15(7), 1502-1506.
Ge, Y., Wang, Q., Wang, L., Wu, H., Peng, C., Wang, J., et al. (2019). Predicting post-stroke pneumonia using deep neural network approaches. International journal of medical informatics 132, 103986.
Geetha, M. (2020). Forecasting the Crop Yield Production in Trichy District Using Fuzzy C-Means Algorithm and Multilayer Perceptron (MLP). International Journal of Knowledge and Systems Science (IJKSS) 11(3), 83-98.
257
Gholizadeh, A., Mišurec, J., Kopačková, V., Mielke, C., and Rogass, C. (2016). Assessment of red-edge position extraction techniques: A case study for norway spruce forests using hymap and simulated sentinel-2 data. Forests 7(10), 226.
Ghorbani, M.A., Zadeh, H.A., Isazadeh, M., and Terzi, O. (2016). A comparative study of artificial neural network (MLP, RBF) and support vector machine models for river flow prediction. Environmental Earth Sciences 75(6), 476.
Gitelson, A., and Merzlyak, M.N. (1994). Quantitative estimation of chlorophyll-ausing reflectance spectra: experiments with autumn chestnut and maple leaves. Journal of Photochemistry and Photobiology B: Biology 22(3), 247-252.
Gitelson, A.A., Kaufman, Y.J., and Merzlyak, M.N. (1996). Use of a green channel in remote sensing of global vegetation from EOS-MODIS. Remote sensing of Environment 58(3), 289-298.
Godfray, H.C.J., Beddington, J.R., Crute, I.R., Haddad, L., Lawrence, D., Muir, J.F., et al. (2010). Food security: the challenge of feeding 9 billion people. science 327(5967), 812-818.
Goldberger, A.S. (1962). Best linear unbiased prediction in the generalized linear regression model. Journal of the American Statistical Association 57(298), 369-375.
Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning. MIT press Cambridge.
Granitto, P.M., Furlanello, C., Biasioli, F., and Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems 83(2), 83-90.
Grömping, U. (2009). Variable importance assessment in regression: linear regression versus random forest. The American Statistician 63(4), 308-319.
Guerini Filho, M., Kuplich, T.M., and Quadros, F.L.D. (2020). Estimating natural grassland biomass by vegetation indices using Sentinel 2 remote sensing data. International Journal of Remote Sensing 41(8), 2861-2876.
Gupta, P.K., Rustgi, S., and Kulwal, P.L. (2005). Linkage disequilibrium and association studies in higher plants: present status and future prospects. Plant molecular biology 57(4), 461-485.
Gupta, R., Tanwar, S., Tyagi, S., and Kumar, N. (2020). Machine learning models for secure data analytics: A taxonomy and threat model. Computer Communications 153, 406-440.
Gupta, R., Vijayan, D., and Prasad, T. (2003). Comparative analysis of red-edge hyperspectral indices. Advances in Space Research 32(11), 2217-2222.
Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning 46(1-3), 389-422.
258
Halim, A.H., and Ismail, I. (Year). "Nonlinear plant modeling using neuro-fuzzy system with Tree Physiology Optimization", in: 2013 IEEE Student Conference on Research and Developement: IEEE), 295-300.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I.H. (2009). The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11(1), 10-18.
Hamet, P., and Tremblay, J. (2017). Artificial intelligence in medicine. Metabolism 69, S36-S40.
Hao, D., Chao, M., Yin, Z., and Yu, D. (2012). Genome-wide association analysis detecting significant single nucleotide polymorphisms for chlorophyll and chlorophyll fluorescence parameters in soybean (Glycine max) landraces. Euphytica 186(3), 919-931.
Harrell Jr, F.E. (2015). Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer.
Hartman, G.L., West, E.D., and Herman, T.K. (2011). Crops that feed the World 2. Soybean—worldwide production, use, and constraints caused by pathogens and pests. Food Security 3(1), 5-17.
Hashiguchi, A., and Komatsu, S. (2017). "Chapter 6 - Proteomics of Soybean Plants," in Proteomics in Food Science, ed. M.L. Colgrave. Academic Press), 89-105.
Hassan, M.A., Yang, M., Rasheed, A., Jin, X., Xia, X., Xiao, Y., et al. (2018). Time-Series Multispectral Indices from Unmanned Aerial Vehicle Imagery Reveal Senescence Rate in Bread Wheat. Remote Sensing 10(6), 809.
Hayes, B., and Goddard, M. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157(4), 1819-1829.
He, J., Jin, Y., Du, Y.-L., Wang, T., Turner, N.C., Yang, R.-P., et al. (2017). Genotypic Variation in Yield, Yield Components, Root Morphology and Architecture, in Soybean in Relation to Water and Phosphorus Supply. Frontiers in plant science 8, 1499.
Heddam, S., Bermad, A., and Dechemi, N. (2011). Applications of radial-basis function and generalized regression neural networks for modeling of coagulant dosage in a drinking water-treatment plant: comparative study. Journal of Environmental Engineering 137(12), 1209-1214.
Hennessy, A., Clarke, K., and Lewis, M. (2020). Hyperspectral Classification of Plants: A Review of Waveband Selection Generalisability. Remote Sensing 12(1), 113.
Herbert, S., and Litchfield, G. (1982). Partitioning Soybean Seed Yield Components 1. Crop science 22(5), 1074-1079.
Hesami, M., Alizadeh, M., Naderi, R., and Tohidfar, M. (2020a). Forecasting and optimizing Agrobacterium-mediated genetic transformation via ensemble model-fruit fly optimization algorithm: A data mining approach using chrysanthemum databases. Plos One 15(9), e0239901.
259
Hesami, M., Condori-Apfata, J.A., Valderrama Valencia, M., and Mohammadi, M. (2020b). Application of Artificial Neural Network for Modeling and Studying In Vitro Genotype-Independent Shoot Regeneration in Wheat. Applied Sciences 10(15), 5370. doi: 10.3390/app10155370.
Hesami, M., and Jones, A.M.P. (2020a). Application of artificial intelligence models and optimization algorithms in plant cell and tissue culture. Applied Microbiology and Biotechnology 104, 1-37.
Hesami, M., and Jones, A.M.P. (2020b). Application of artificial intelligence models and optimization algorithms in plant cell and tissue culture. Applied Microbiology and Biotechnology 104. doi: 10.1007/s00253-020-10888-2.
Hesami, M., and Jones, A.M.P. (2020c). Application of artificial intelligence models and optimization algorithms in plant cell and tissue culture. Applied Microbiology and Biotechnology. doi: 10.1007/s00253-020-10888-2.
Hesami, M., Naderi, R., and Tohidfar, M. (2019a). Modeling and Optimizing Medium Composition for Shoot Regeneration of Chrysanthemum via Radial Basis Function-Non-dominated Sorting Genetic Algorithm-II (RBF-NSGAII). Scientific Reports 9(1), 1-11.
Hesami, M., Naderi, R., and Tohidfar, M. (2020c). Introducing a hybrid artificial intelligence method for high-throughput modeling and optimizing plant tissue culture processes: the establishment of a new embryogenesis medium for chrysanthemum, as a case study. Applied Microbiology and Biotechnology 104(23), 10249-10263.
Hesami, M., Naderi, R., Tohidfar, M., and Yoosefzadeh-Najafabadi, M. (2019b). Application of adaptive neuro-fuzzy inference system-non-dominated sorting genetic Algorithm-II (ANFIS-NSGAII) for modeling and optimizing somatic embryogenesis of Chrysanthemum. Frontiers in plant science 10, 869.
Hesami, M., Naderi, R., Tohidfar, M., and Yoosefzadeh-Najafabadi, M. (2020d). Development of support vector machine-based model and comparative analysis with artificial neural network for modeling the plant tissue culture procedures: effect of plant growth regulators on somatic embryogenesis of chrysanthemum, as a case study. Plant Methods 16(1), 1-15.
Hill, W., and Robertson, A. (1968). Linkage disequilibrium in finite populations. Theoretical and Applied Genetics 38(6), 226-231.
Holland, J.H. (1992). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press.
Homack, S.R. (2001). "Understanding What ANOVA Post Hoc Tests Are, Really", in: Annual Meeting of the Southwest Educational Research Association. (New Orleans, LA, USA).
260
Hu, D., Zhang, H., Du, Q., Hu, Z., Yang, Z., Li, X., et al. (2020). Genetic dissection of yield-related traits via genome-wide association analysis across multiple environments in wild soybean (Glycine soja Sieb. and Zucc.). Planta 251(2), 39.
Hu, P., Zheng, Q., Luo, Q., Teng, W., Li, H., Li, B., et al. (2021). Genome-wide association study of yield and related traits in common wheat under salt-stress conditions. BMC Plant Biology 21(1), 1-20.
Hu, Z., Zhang, D., Zhang, G., Kan, G., Hong, D., and Yu, D. (2014). Association mapping of yield-related traits and SSR markers in wild soybean (Glycine soja Sieb. and Zucc.). Breeding science 63(5), 441-449.
Huang, H., Zeng, Q., Pei, X., Wong, S., and Xu, P. (2016). Predicting crash frequency using an optimised radial basis function neural network model. Transportmetrica A: transport science 12(4), 330-345.
Huang, X., Wei, X., Sang, T., Zhao, Q., Feng, Q., Zhao, Y., et al. (2010). Genome-wide association studies of 14 agronomic traits in rice landraces. Nat Genet 42. doi: 10.1038/ng.695.
Hussin, S.H., Wang, H., Tang, S., Zhi, H., Tang, C., Zhang, W., et al. (2021). SiMADS34, an E-class MADS-box transcription factor, regulates inflorescence architecture and grain yield in Setaria italica. Plant Molecular Biology 105(4), 419-434.
Hwang, E.-Y., Song, Q., Jia, G., Specht, J.E., Hyten, D.L., Costa, J., et al. (2014). A genome-wide association study of seed protein and oil content in soybean. BMC genomics 15(1), 1.
Hwang, J.-U., Song, W.-Y., Hong, D., Ko, D., Yamaoka, Y., Jang, S., et al. (2016). Plant ABC transporters enable many unique aspects of a terrestrial plant's lifestyle. Molecular plant 9(3), 338-355.
Hyten, D.L., Choi, I.-Y., Song, Q., Shoemaker, R.C., Nelson, R.L., Costa, J.M., et al. (2007). Highly variable patterns of linkage disequilibrium in multiple soybean populations. Genetics 175(4), 1937-1944.
Hyten, D.L., Song, Q., Zhu, Y., Choi, I.-Y., Nelson, R.L., Costa, J.M., et al. (2006). Impacts of genetic bottlenecks on soybean genome diversity. Proceedings of the National Academy of Sciences 103(45), 16666-16671.
Jafari, M., and Shahsavar, A. (2020). The application of artificial neural networks in modeling and predicting the effects of melatonin on morphological responses of citrus to drought stress. Plos one 15(10), e0240427.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to statistical learning. Springer.
Jamil, I.N., Remali, J., Azizan, K.A., Muhammad, N.A.N., Arita, M., Goh, H.-H., et al. (2020). Systematic Multi-Omics Integration (MOI) Approach in Plant Systems Biology. Frontiers in Plant Science 11.
261
Ji, J., Scott, M., and Bhattacharyya, M. (2006). Light is Essential for Degradation of Ribulose‐1, 5‐Bisphosphate Carboxylase‐Oxygenase Large Subunit During Sudden Death Syndrome Development in Soybean. Plant Biology 8(5), 597-605.
Jia, T., Wang, L., Li, J., Ma, J., Cao, Y., Lübberstedt, T., et al. (2020). Integrating a genome-wide association study with transcriptomic analysis to detect genes controlling grain drying rate in maize (Zea may, L.). Theoretical and Applied Genetics 133(2), 623-634.
Jiang, S., Zhang, H., Ni, P., Yu, S., Dong, H., Zhang, A., et al. (2020a). Genome-Wide Association Study Dissects the Genetic Architecture of Maize Husk Tightness. Frontiers in plant science 11, 861.
Jiang, Y., Lindsay, D.L., Davis, A.R., Wang, Z., MacLean, D.E., Warkentin, T.D., et al. (2020b). Impact of heat stress on pod‐based yield components in field pea (Pisum sativum L.). Journal of Agronomy and Crop Science 206(1), 76-89.
Jiang, Y., Wei, G., Sun, X., and Zhang, Y. (Year). "Predicting Noisy Data with an Improvement RBF Neural Network for Surrogate Models", in: 2016 4th International Conference on Machinery, Materials and Computing Technology: Atlantis Press).
Jin, J., Liu, X., Wang, G., Mi, L., Shen, Z., Chen, X., et al. (2010). Agronomic and physiological contributions to the yield improvement of soybean cultivars released from 1950 to 2006 in Northeast China. Field Crops Research 115(1), 116-123.
Jin, Z., Zhuang, Q., Tan, Z., Dukes, J.S., Zheng, B., and Melillo, J.M. (2016). Do maize models capture the impacts of heat and drought stresses on yield? Using algorithm ensembles to identify successful approaches. Global change biology 22(9), 3112-3126.
Jolly, W.M., Nemani, R., and Running, S.W. (2005). A generalized, bioclimatic index to predict foliar phenology in response to climate. Global Change Biology 11(4), 619-632.
Jordan, M.I., and Mitchell, T.M. (2015). Machine learning: Trends, perspectives, and prospects. Science 349(6245), 255-260.
Ju, C., Bibaut, A., and van der Laan, M. (2018). The relative performance of ensemble methods with deep convolutional neural networks for image classification. Journal of Applied Statistics 45(15), 2800-2818.
Kahlon, C.S., and Board, J.E. (2012). Growth dynamic factors explaining yield improvement in new versus old soybean cultivars. Journal of crop improvement 26(2), 282-299.
Kahlon, C.S., Board, J.E., and Kang, M.S. (2011). An analysis of yield component changes for new vs. old soybean cultivars. Agronomy Journal 103(1), 13-22.
Kaler, A.S., Dhanapal, A.P., Ray, J.D., King, C.A., Fritschi, F.B., and Purcell, L.C. (2017). Genome‐wide association mapping of carbon isotope and oxygen isotope ratios in diverse soybean genotypes. Crop Science 57(6), 3085-3100.
262
Kaler, A.S., Gillman, J.D., Beissinger, T., and Purcell, L.C. (2020). Comparing different statistical models and multiple testing corrections for association mapping in soybean and maize. Frontiers in plant science 10, 1794.
Kan, G., Zhang, W., Yang, W., Ma, D., Zhang, D., Hao, D., et al. (2015). Association mapping of soybean seed germination under salt stress. Molecular Genetics and Genomics 290(6), 2147-2162.
Kandel, Y.R., Wise, K.A., Bradley, C.A., Tenuta, A.U., and Mueller, D.S. (2016). Effect of planting date, seed treatment, and cultivar on plant population, sudden death syndrome, and yield of soybean. Plant Disease 100(8), 1735-1743.
Kaneko, H. (2020). Support vector regression that takes into consideration the importance of explanatory variables. Journal of Chemometrics, e3327.
Karikari, B., Wang, Z., Zhou, Y., Yan, W., Feng, J., and Zhao, T. (2020). Identification of quantitative trait nucleotides and candidate genes for soybean seed weight by multiple models of genome-wide association study. BMC plant biology 20(1), 1-14.
Karmakar, P., and Bhatnagar, P. (1996). Genetic improvement of soybean varieties released in India from 1969 to 1993. Euphytica 90(1), 95-103.
Katsileros, A., Drosou, K., and Koukouvinos, C. (2015a). Evaluation of nearest neighbor methods in wheat genotype experiments. Communications in Biometry and Crop Science 10, 115-123.
Katsileros, A., Drosou, K., and Koukouvinos, C. (2015b). Evaluation of nearest neighbor methods in wheat genotype experiments. Commun. Biom. Crop Sci 10, 115-123.
Kaur, P., Bayer, P.E., Milec, Z., Vrána, J., Yuan, Y., Appels, R., et al. (2017). An advanced reference genome of Trifolium subterraneum L. reveals genes related to agronomic performance. Plant Biotechnology Journal 15(8), 1034-1046.
Kayad, A.G., Al-Gaadi, K.A., Tola, E., Madugundu, R., Zeyada, A.M., and Kalaitzidis, C. (2016). Assessing the spatial variability of alfalfa yield using satellite imagery and ground-based data. PLoS One 11(6), e0157166.
Kenga, R., Tenkouano, A., Gupta, S., and Alabi, S. (2006). Genetic and phenotypic association between yield components in hybrid sorghum (Sorghum bicolor (L.) Moench) populations. Euphytica 150(3), 319-326.
Khan, R., Hanbury, A., and Stoettinger, J. (Year). "Skin detection: A random forest approach", in: 2010 IEEE International Conference on Image Processing: IEEE), 4613-4616.
Kim, C.Y., Bove, J., and Assmann, S.M. (2008). Overexpression of wound‐responsive RNA‐binding proteins induces leaf senescence and hypersensitive‐like cell death. New Phytologist 180(1), 57-70.
Kim, E.-H., Ro, H.-M., Kim, S.-L., Kim, H.-S., and Chung, I.-M. (2012). Analysis of isoflavone, phenolic, soyasapogenol, and tocopherol compounds in soybean [Glycine max (L.) Merrill] germplasms of different seed weights and origins. Journal of agricultural and food chemistry 60(23), 6045-6055.
263
Kim, E., Kim, S., Chung, J., Chi, H., Kim, J., and Chung, I. (2006). Analysis of phenolic compounds and isoflavones in soybean seeds (Glycine max (L.) Merill) and sprouts grown under different conditions. European Food Research and Technology 222(1-2), 201.
Kim, G.B., Kim, W.J., Kim, H.U., and Lee, S.Y. (2020a). Machine learning applications in systems metabolic engineering. Current opinion in biotechnology 64, 1-9.
Kim, K.D., Kang, Y., and Kim, C. (2020b). Application of Genomic Big Data in Plant Breeding: Past, Present, and Future. Plants 9(11), 1454.
Kim, M.Y., Lee, S., Van, K., Kim, T.-H., Jeong, S.-C., Choi, I.-Y., et al. (2010). Whole-genome sequencing and intensive analysis of the undomesticated soybean (Glycine soja Sieb. and Zucc.) genome. Proceedings of the National Academy of Sciences 107(51), 22032-22037.
Kim, S., Yamaoka, Y., Ono, H., Kim, H., Shim, D., Maeshima, M., et al. (2013). AtABCA9 transporter supplies fatty acids for lipid synthesis to the endoplasmic reticulum. Proceedings of the national academy of sciences 110(2), 773-778.
King, R.A., Deb, K., and Rughooputh, H. (2010). Comparison of NSGA-II and SPEA2 on the multiobjective environmental/economic dispatch problem. University of Mauritius Research Journal 16(1), 485-511.
Knipling, E.B. (1970). Physical and physiological basis for the reflectance of visible and near-infrared radiation from vegetation. Remote sensing of environment 1(3), 155-159.
Kooistra, L., Suomalainen, J., Iqbal, S., Franke, J., Wenting, P., Bartholomeus, H., et al. (Year). "Crop monitoring using a light-weight hyperspectral mapping system for unmanned aerial vehicles: first results for the 2013 season", in: Proceedings of 2013 Workshop on UAV-based Remote Sensing Methods for Monitoring Vegetation, Cologne, Germany), 5158.
Kosmowski, F., and Worku, T. (2018). Evaluation of a miniaturized NIR spectrometer for cultivar identification: The case of barley, chickpea and sorghum in Ethiopia. PloS one 13(3), e0193620.
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of statistical software 28(5), 1-26.
Kuhn, M., and Johnson, K. (2013). Applied predictive modeling. Springer. Kuhn, M., Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A., et al. (2020).
Package ‘caret’. The R Journal. Kukana, P. (Year). "Hybrid Machine Learning Algorithm-Based Paddy Leave Disease
Detection System", in: 2020 International Conference on Smart Electronics and Communication (ICOSEC): IEEE), 512-519.
Kumar, R., and Silva, L. (1973). Light ray tracing through a leaf cross section. Applied Optics 12(12), 2950-2954.
264
Kumudini, S., Hume, D.J., and Chu, G. (2001). Genetic improvement in short season soybeans. Crop science 41(2), 391-398.
KUSWANTORO, H. (2017). Genetic variability and heritability of acid-adaptive soybean promising lines. Biodiversitas Journal of Biological Diversity 18(1).
Kycko, M., Zagajewski, B., Lavender, S., Romanowska, E., and Zwijacz-Kozica, M. (2018). The Impact of Tourist Traffic on the Condition and Cell Structures of Alpine Swards. Remote Sensing 10(2), 220.
Labrinidis, A., and Jagadish, H.V. (2012). Challenges and opportunities with big data. Proceedings of the VLDB Endowment 5(12), 2032-2033.
Lambermon, M.H., Fu, Y., Kirk, D.A.W., Dupasquier, M., Filipowicz, W., and Lorković, Z.J. (2002). UBA1 and UBA2, two proteins that interact with UBP1, a multifunctional effector of pre-mRNA maturation in plants. Molecular and Cellular Biology 22(12), 4346-4357.
Lane, H.M., Murray, S.C., Montesinos‐López, O.A., Montesinos‐López, A., Crossa, J., Rooney, D.K., et al. (2020). Phenomic selection and prediction of maize grain yield from near‐infrared reflectance spectroscopy of kernels. The Plant Phenome Journal 3(1), e20002.
Langenberg, P., Balda, E., Behboodi, A., and Mathar, R. (Year). "On the Robustness of Support Vector Machines against Adversarial Examples", in: 2019 13th International Conference on Signal Processing and Communication Systems (ICSPCS): IEEE), 1-6.
Lawrence, S., and Giles, C.L. (Year). "Overfitting and neural networks: conjugate gradient and backpropagation", in: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium: IEEE), 114-119.
Le, B.H., Cheng, C., Bui, A.Q., Wagmaister, J.A., Henry, K.F., Pelletier, J., et al. (2010). Global analysis of gene activity during Arabidopsis seed development and identification of seed-specific transcription factors. Proceedings of the National Academy of Sciences 107(18), 8063-8070.
Leamy, L.J., Zhang, H., Li, C., Chen, C.Y., and Song, B.-H. (2017). A genome-wide association study of seed composition traits in wild soybean (Glycine soja). BMC genomics 18(1), 1-15.
Lee, J.H., Ryu, H.-S., Chung, K.S., Posé, D., Kim, S., Schmid, M., et al. (2013). Regulation of temperature-responsive flowering by MADS-box transcription factor repressors. Science 342(6158), 628-632.
Lee, S., Liang, X., Woods, M., Reiner, A.S., Concannon, P., Bernstein, L., et al. (2020). Machine learning on genome-wide association studies to predict the risk of radiation-associated contralateral breast cancer in the WECARE Study. PloS one 15(2), e0226157.
265
Legris, M., Ince, Y.Ç., and Fankhauser, C. (2019). Molecular mechanisms underlying phytochrome-controlled morphogenesis in plants. Nature communications 10(1), 1-15.
Lepetit, V., Lagger, P., and Fua, P. (Year). "Randomized trees for real-time keypoint recognition", in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05): IEEE), 775-781.
Lepiniec, L., Devic, M., Roscoe, T., Bouyer, D., Zhou, D.-X., Boulard, C., et al. (2018). Molecular and epigenetic regulations and functions of the LAFL transcriptional regulators that control seed development. Plant reproduction 31(3), 291-307.
Li, C., Zou, J., Jiang, H., Yu, J., Huang, S., Wang, X., et al. (2018a). Identification and validation of number of pod‐and seed‐related traits QTL s in soybean. Plant Breeding 137(5), 730-745.
Li, J., Kinoshita, T., Pandey, S., Ng, C.K.-Y., Gygi, S.P., Shimazaki, K.-i., et al. (2002). Modulation of an RNA-binding protein by abscisic-acid-activated protein kinase. Nature 418(6899), 793-797.
Li, S., Zhang, C., Yang, D., Lu, M., Qian, Y., Jin, F., et al. (2021). Detection of QTNs for kernel moisture concentration and kernel dehydration rate before physiological maturity in maize using multi-locus GWAS. Scientific reports 11(1), 1-10.
Li, X., Tian, R., Kamala, S., Du, H., Li, W., Kong, Y., et al. (2018b). Identification and verification of pleiotropic QTL controlling multiple amino acid contents in soybean seed. Euphytica 214(6), 1-14.
Li, Y.-h., Reif, J.C., Ma, Y.-s., Hong, H.-l., Liu, Z.-x., Chang, R.-z., et al. (2015). Targeted association mapping demonstrating the complex molecular genetics of fatty acid formation in soybean. BMC genomics 16(1), 1-13.
Li, Y.h., Shi, X.h., Li, H.h., Reif, J.C., Wang, J.j., Liu, Z.x., et al. (2016). Dissecting the genetic basis of resistance to soybean cyst nematode combining linkage and association mapping. The plant genome 9(2), plantgenome2015.2004.0020.
Liakos, K.G., Busato, P., Moshou, D., Pearson, S., and Bochtis, D. (2018). Machine learning in agriculture: A review. Sensors 18(8), 2674.
Liaw, A., and Wiener, M. (2002). Classification and regression by randomForest. R news 2(3), 18-22.
Lin, F., Wani, S.H., Collins, P.J., Wen, Z., Li, W., Zhang, N., et al. (2020). QTL mapping and GWAS for identification of loci conferring partial resistance to Pythium sylvaticum in soybean (Glycine max (L.) Merr). Molecular Breeding 40, 1-11.
Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K., et al. (Year). "Large-scale image classification: Fast feature extraction and SVM training", in: CVPR 2011: IEEE), 1689-1696.
Lipka, A.E., Kandianis, C.B., Hudson, M.E., Yu, J., Drnevich, J., Bradbury, P.J., et al. (2015). From association to prediction: statistical methods for the dissection and selection of complex traits in plants. Current Opinion in Plant Biology 24, 110-118.
266
Lipka, A.E., Tian, F., Wang, Q., Peiffer, J., Li, M., Bradbury, P.J., et al. (2012). GAPIT: genome association and prediction integrated tool. Bioinformatics 28(18), 2397-2399.
Liu, B., Liu, X., Wang, C., Li, Y., Jin, J., and Herbert, S. (2010). Soybean yield and yield component distribution across the main axis in response to light enrichment and shading under different densities. Plant, Soil and Environment 56(8), 384-392.
Liu, C., Teo, Z.W.N., Bi, Y., Song, S., Xi, W., Yang, X., et al. (2013). A conserved genetic pathway determines inflorescence architecture in Arabidopsis and rice. Developmental cell 24(6), 612-622.
Liu, F., Quesada, V., Crevillén, P., Bäurle, I., Swiezewski, S., and Dean, C. (2007). The Arabidopsis RNA-binding protein FCA requires a lysine-specific demethylase 1 homolog to downregulate FLC. Molecular cell 28(3), 398-407.
Liu, X., Huang, M., Fan, B., Buckler, E.S., and Zhang, Z. (2016). Iterative usage of fixed and random effect models for powerful and efficient genome-wide association studies. PLoS genetics 12(2), e1005767.
Liu, X., Jin, J., Herbert, S., Zhang, Q., and Wang, G. (2005). Yield components, dry matter, LAI and LAD of soybeans in Northeast China. Field Crops Research 93(1), 85-93.
Liu, X., Jin, J., Wang, G., and Herbert, S.J. (2008). Soybean yield physiology and development of high-yielding practices in Northeast China. Field Crops Research 105(3), 157-171. doi: https://doi.org/10.1016/j.fcr.2007.09.003.
Liu, X., Qin, D., Piersanti, A., Zhang, Q., Miceli, C., and Wang, P. (2020). Genome-wide association study identifies candidate genes related to oleic acid content in soybean seeds. BMC plant biology 20(1), 1-14.
Lobell, D.B., Di Tommaso, S., You, C., Yacoubou Djima, I., Burke, M., and Kilic, T. (2020). Sight for sorghums: Comparisons of Satellite-and ground-based sorghum yield estimates in mali. Remote Sensing 12(1), 100.
Lopez-Cruz, M., Olson, E., Rovere, G., Crossa, J., Dreisigacker, S., Mondal, S., et al. (2020). Regularized selection indices for breeding value prediction using hyper-spectral image data. Scientific Reports 10(1), 1-12.
Loughrey, J., and Cunningham, P. (Year). "Overfitting in wrapper-based feature subset selection: The harder you try the worse it gets", in: International Conference on Innovative Techniques and Applications of Artificial Intelligence: Springer), 33-43.
Lu, X., Wang, J., Wang, Y., Wen, W., Zhang, Y., Du, J., et al. (2021). Genome-Wide Association Study of Maize Aboveground Dry Matter Accumulation at Seedling Stage. Frontiers in Genetics 11, 1679.
Lykhovyd, P. (2020). Sweet corn yield simulation using normalized difference vegetation index and leaf area index. Journal of Ecological Engineering 21(3).
267
Ma, B., Dwyer, L.M., Costa, C., Cober, E.R., and Morrison, M.J. (2001). Early prediction of soybean yield from canopy reflectance measurements. Agronomy Journal 93(6), 1227-1234.
Ma, B., Morrison, M.J., and Dwyer, L.M. (1996). Canopy light reflectance and field greenness to assess nitrogen fertilization and yield of maize. Agronomy Journal 88(6), 915-920.
MacArthur, J., Bowler, E., Cerezo, M., Gil, L., Hall, P., Hastings, E., et al. (2017). The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic acids research 45(D1), D896-D901.
Mackay, T.F. (2001). The genetic architecture of quantitative traits. Annual review of genetics 35(1), 303-339.
Maganathan, T., Senthilkumar, S., and Balakrishnan, V. (Year). "Machine Learning and Data Analytics for Environmental Science: A Review, Prospects and Challenges", in: IOP Conference Series: Materials Science and Engineering: IOP Publishing), 012107.
Mahalingam, T., and Subramoniam, M. (2020). ACO–MKFCM: An Optimized Object Detection and Tracking Using DNN and Gravitational Search Algorithm. Wireless Personal Communications 110(3), 1567-1604.
Maheta, H.H., and Dabhi, V.K. (Year). "An improved SPEA2 Multi objective algorithm with non dominated elitism and Generational Crossover", in: 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT): IEEE), 75-82.
Mahood, E.H., Kruse, L.H., and Moghe, G.D. (2020). Machine learning: A powerful tool for gene function prediction in plants. Applications in Plant Sciences 8(7), e11376.
Maimaitijiang, M., Sagan, V., Sidike, P., Hartling, S., Esposito, F., and Fritschi, F.B. (2020). Soybean yield prediction from UAV using multimodal data fusion and deep learning. Remote Sensing of Environment 237, 111599.
Majhi, P.K., Mogali, S.C., and Abhisheka, L. (2020). Genetic variability, heritability, genetic advance and correlation studies for seed yield and yield components in early segregating lines (F3) of greengram [Vigna radiata (L.) Wilczek]. International Journal of Chemical Studies 8(4), 1283-1288.
Makantasis, K., Karantzalos, K., Doulamis, A., and Doulamis, N. (Year). "Deep supervised learning for hyperspectral data classification through convolutional neural networks", in: 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS): IEEE), 4959-4962.
Maldonado-Mota, C., Moghaddam, S., Schröder, S., Hurtado-Gonzales, O., McClean, P., Pasche, J., et al. (2020). Genomic regions associated with resistance to anthracnose in the Guatemalan climbing bean (Phaseolus vulgaris L.) germplasm collection. Genetic Resources and Crop Evolution, 1-11.
268
Malle, S., Eskandari, M., Morrison, M., and Belzile, F. (2020). Genome-wide association identifies several QTLs controlling cysteine and methionine content in soybean seed including some promising candidate genes. Scientific reports 10(1), 1-14.
Mamidi, S., Chikara, S., Goos, R.J., Hyten, D.L., Annam, D., Moghaddam, S.M., et al. (2011). Genome-wide association analysis identifies candidate genes associated with iron deficiency chlorosis in soybean. The Plant Genome 4(3), 154-164.
Mangena, P. (2020). Phytocystatins and their Potential Application in the Development of Drought Tolerance Plants in Soybeans (Glycine max L.). Protein and Peptide Letters 27(2), 135-144.
Mao, T., Li, J., Wen, Z., Wu, T., Wu, C., Sun, S., et al. (2017). Association mapping of loci controlling genetic and environmental interaction of soybean flowering time under various photo-thermal conditions. BMC genomics 18(1), 1-17.
Mariotto, I., Thenkabail, P.S., Huete, A., Slonecker, E.T., and Platonov, A. (2013). Hyperspectral versus multispectral crop-productivity modeling and type discrimination for the HyspIRI mission. Remote Sensing of Environment 139, 291-305.
Marti, J., Bort, J., Slafer, G., and Araus, J. (2007). Can wheat yield be assessed by early measurements of normalized difference vegetation index? Annals of Applied biology 150(2), 253-257.
McMurtrey Iii, J., Chappelle, E., Kim, M., Meisinger, J., and Corp, L. (1994). Distinguishing nitrogen fertilization levels in field corn (Zea mays L.) with actively induced fluorescence and passive reflectance measurements. Remote Sensing of Environment 47(1), 36-44.
McQueen, R.J., Garner, S.R., Nevill-Manning, C.G., and Witten, I.H. (1995). Applying machine learning to agricultural data. Computers and electronics in agriculture 12(4), 275-293.
Mehdaoui, R., and Anane, M. (2020). Exploitation of the red-edge bands of Sentinel 2 to improve the estimation of durum wheat yield in Grombalia region (Northeastern Tunisia). International Journal of Remote Sensing 41(23), 8986-9008.
Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research 7(Jun), 983-999.
Melesse, A.M., Khosravi, K., Tiefenbacher, J.P., Heddam, S., Kim, S., Mosavi, A., et al. (2020). River Water Salinity Prediction Using Hybrid Machine Learning Models. Water 12(10), 2951.
Merlier, E., Hmimina, G., Dufrêne, E., Soudani, K.J.J.o.P., and Biology, P.B. (2015). Explaining the variability of the photochemical reflectance index (PRI) at the canopy-scale: Disentangling the effects of phenological and physiological changes. 151, 161-171.
269
Merzlyak, M.N., Solovchenko, A.E., and Gitelson, A.A. (2003). Reflectance spectral features and non-destructive estimation of chlorophyll, carotenoid and anthocyanin content in apple fruit. Postharvest biology and technology 27(2), 197-211.
Miransari, M. (2016). "11 - Soybean production and N fertilization," in Abiotic and Biotic Stresses in Soybean Production, ed. M. Miransari. (San Diego: Academic Press), 241-260.
Mladenova, I.E., Bolten, J.D., Crow, W.T., Anderson, M.C., Hain, C.R., Johnson, D.M., et al. (2017). Intercomparison of soil moisture, evaporative stress, and vegetation indices for estimating corn and soybean yields over the US. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10(4), 1328-1343.
Moellers, T.C., Singh, A., Zhang, J., Brungardt, J., Kabbage, M., Mueller, D.S., et al. (2017). Main and epistatic loci studies in soybean for Sclerotinia sclerotiorum resistance reveal multiple modes of resistance in multi-environments. Scientific reports 7(1), 1-13.
Mohammadi, M., Xavier, A., Beckett, T., Beyer, S., Chen, L., Chikssa, H., et al. (2020). Identification, Deployment, and Transferability of Quantitative Trait Loci from Genome-Wide Association Studies in Plants. Current Plant Biology, 100145.
Mokry, F.B., Higa, R.H., de Alvarenga Mudadu, M., de Lima, A.O., Meirelles, S.L.C., da Silva, M.V.G.B., et al. (2013). Genome-wide association study for backfat thickness in Canchim beef cattle using Random Forest approach. BMC genetics 14(1), 47.
Mondal, S., Rutkoski, J.E., Velu, G., Singh, P.K., Crespo-Herrera, L.A., Guzmán, C., et al. (2016). Harnessing Diversity in Wheat to Enhance Grain Yield, Climate Resilience, Disease and Insect Pest Resistance and Nutrition Through Conventional and Modern Breeding Approaches. 7(991). doi: 10.3389/fpls.2016.00991.
Montesinos-López, O.A., Montesinos-López, A., Crossa, J., de los Campos, G., Alvarado, G., Suchismita, M., et al. (2017a). Predicting grain yield using canopy hyperspectral reflectance in wheat breeding data. Plant methods 13(1), 4.
Montesinos-López, O.A., Montesinos-López, A., Crossa, J., los Campos, G., Alvarado, G., Suchismita, M., et al. (2017b). Predicting grain yield using canopy hyperspectral reflectance in wheat breeding data. Plant methods 13(1), 4.
Morrison, M.J., Voldeng, H.D., and Cober, E.R. (2000). Agronomic changes from 58 years of genetic improvement of short-season soybean cultivars in Canada. Agronomy Journal 92(4), 780-784.
Muhammed, H.H. (2005). Hyperspectral crop reflectance data for characterising and estimating fungal disease severity in wheat. Biosystems Engineering 91(1), 9-20.
Murakoshi, K. (2005). Avoiding overfitting in multilayer perceptrons with feeling-of-knowing using self-organizing maps. BioSystems 80(1), 37-40.
270
Mutanga, O., and Skidmore, A.K. (2004). Narrow band vegetation indices overcome the saturation problem in biomass estimation. International journal of remote sensing 25(19), 3999-4014.
Na, J.-K., Kim, J.-K., Kim, D.-Y., and Assmann, S.M. (2015). Expression of potato RNA-binding proteins StUBA2a/b and StUBA2c induces hypersensitive-like cell death and early leaf senescence in Arabidopsis. Journal of experimental botany 66(13), 4023-4033.
Nakaoka, H., and Inoue, I. (2009). Meta-analysis of genetic association studies: methodologies, between-study heterogeneity and winner's curse. Journal of human genetics 54(11), 615-623.
Nasser, T., and Tariq, R. (2015). Big data challenges. J Comput Eng Inf Technol 4: 3. doi: http://dx. doi. org/10.4172/2324 9307(2).
Nezami-Alanagh, E., Garoosi, G.-A., Landín, M., and Gallego, P.P. (2018). Combining DOE with neurofuzzy logic for healthy mineral nutrition of pistachio rootstocks in vitro culture. Frontiers in plant science 9, 1474.
Nguyen, M.H., and De la Torre, F. (2010). Optimal feature selection for support vector machines. Pattern recognition 43(3), 584-591.
Niazian, M., and Niedbała, G. (2020). Machine Learning for Plant Breeding and Biotechnology. Agriculture 10(10), 436.
Nico, M., Miralles, D.J., and Kantolic, A.G. (2019). Natural post-flowering photoperiod and photoperiod sensitivity: Roles in yield-determining processes in soybean. Field Crops Research 231, 141-152.
Nigam, A., Tiwari, A.K., and Pandey, A. (2020). Paddy leaf diseases recognition and classification using PCA and BFO-DNN algorithm by image processing. Materials Today: Proceedings 33, 4856–4862.
Nordborg, M., and Tavaré, S. (2002). Linkage disequilibrium: what history has to tell us. TRENDS in Genetics 18(2), 83-90.
Nourani, V., Elkiran, G., and Abba, S. (2018). Wastewater treatment plant performance analysis using artificial intelligence–an ensemble approach. Water Science and Technology 78(10), 2064-2076.
Nutter Jr, F., Tylka, G., Guan, J., Moreira, A., Marett, C., Rosburg, T., et al. (2002). Use of remote sensing to detect soybean cyst nematode-induced plant stress. Journal of Nematology 34(3), 222.
O’Connor, K., Hayes, B., and Topp, B. (2018). Prospects for increasing yield in macadamia using component traits and genomics. Tree genetics & genomes 14(1), 7.
Ogutu, J.O., Piepho, H.-P., and Schulz-Streeck, T. (Year). "A comparison of random forests, boosting and support vector machines for genomic selection", in: BMC proceedings: Springer), S11.
271
Okamoto, A., Hanagata, H., Kawamura, Y., and Yanagida, F. (1995). Anti-hypertensive substances in fermented soybean, natto. Plant foods for human nutrition 47(1), 39-47.
Ollinger, S.V. (2011). Sources of variability in canopy reflectance and the convergent properties of plants. New Phytologist 189(2), 375-394.
Orozco-Arroyo, G., Paolo, D., Ezquer, I., and Colombo, L. (2015). Networks controlling seed size in Arabidopsis. Plant reproduction 28(1), 17-32.
Orr, M.J. (1996). "Introduction to radial basis function networks". (Edinburgh, UK: Technical Report, center for cognitive science, University of Edinburgh).
Ozaki, Y., McClure, W.F., and Christy, A.A. (2006). Near-infrared spectroscopy in food science and technology. Hoboken, NJ, USA: John Wiley & Sons.
Pal, M. (2005). Random forest classifier for remote sensing classification. International Journal of Remote Sensing 26(1), 217-222.
Pal, M. (2007). Ensemble learning with decision tree for remote sensing classification. World Academy of Science, Engineering and Technology 36, 258-260.
Pal, S.K., and Mitra, S. (1992a). Multilayer perceptron, fuzzy sets, classifiaction. Pal, S.K., and Mitra, S. (1992b). Multilayer perceptron, fuzzy sets, classifiaction. IEEE
Transactions on neural networks 3, 683-697. Pantazi, X.E., Moshou, D., Alexandridis, T., Whetton, R.L., and Mouazen, A.M. (2016).
Wheat yield prediction using machine learning and advanced sensing techniques. Computers and Electronics in Agriculture 121, 57-65.
Parry, M.A., Reynolds, M., Salvucci, M.E., Raines, C., Andralojc, P.J., Zhu, X.-G., et al. (2010). Raising yield potential of wheat. II. Increasing photosynthetic capacity and efficiency. Journal of experimental botany 62(2), 453-467.
Pasaniuc, B., and Price, A.L. (2017). Dissecting the genetics of complex traits using summary association statistics. Nature Reviews Genetics 18(2), 117-127.
Pedersen, P., Kumudini, S., Board, J., and Conley, S. (2004). Soybean growth and development. Iowa State University, University Extension Ames, IA.
Pedersen, P., and Lauer, J.G. (2004). Response of soybean yield components to management system and planting date. Agronomy Journal 96(5), 1372-1381.
Pelletier, J.M., Kwong, R.W., Park, S., Le, B.H., Baden, R., Cagliari, A., et al. (2017). LEC1 sequentially regulates the transcription of genes involved in diverse developmental processes during seed development. Proceedings of the National Academy of Sciences 114(32), E6710-E6719.
Phan, P., Chen, N., Xu, L., and Chen, Z. (2020). Using Multi-Temporal MODIS NDVI Data to Monitor Tea Status and Forecast Yield: A Case Study at Tanuyen, Laichau, Vietnam. Remote Sensing 12(11), 1814.
Prey, L., and Schmidhalter, U. (2019). Simulation of satellite reflectance data using high-frequency ground based hyperspectral canopy measurements for in-season
272
estimation of grain yield and grain nitrogen status in winter wheat. ISPRS Journal of Photogrammetry and Remote Sensing 149, 176-187.
Price, A.H. (2006). Believe it or not, QTLs are accurate! Trends in plant science 11(5), 213-216.
Price, J.C., and Bausch, W.C. (1995). Leaf area index estimation from visible and near-infrared reflectance data. Remote Sensing of Environment 52(1), 55-65.
Price, T., and Schluter, D. (1991). On the low heritability of life‐history traits. Evolution 45(4), 853-861.
Priolli, R.H.G., Campos, J., Stabellini, N., Pinheiro, J., and Vello, N. (2015). Association mapping of oil content and fatty acid components in soybean. Euphytica 203(1), 83-96.
Pritchard, J.K., and Donnelly, P. (2001). Case–control studies of association in structured or admixed populations. Theoretical population biology 60(3), 227-237.
Qi, Y. (2012). "Random forest for bioinformatics," in Ensemble machine learning. Springer), 307-323.
Qin, J., Song, Q., Shi, A., Li, S., Zhang, M., and Zhang, B. (2017). Genome-wide association mapping of resistance to Phytophthora sojae in a soybean [Glycine max (L.) Merr.] germplasm panel from maturity groups IV and V. PLOS ONE 12(9), e0184613. doi: 10.1371/journal.pone.0184613.
Qiu, R., Wei, S., Zhang, M., Li, H., Sun, H., Liu, G., et al. (2018). Sensors for measuring plant phenotyping: A review. International Journal of Agricultural and Biological Engineering 11(2), 1-17.
Raj, A., Stephens, M., and Pritchard, J.K. (2014). fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197(2), 573-589.
Ramasubramanian, V., and Beavis, W.D. (2020). Factors affecting Response to Recurrent Genomic Selection in Soybeans. BioRxiv.
Rashid, Z., Sofi, M., Harlapur, S.I., Kachapur, R.M., Dar, Z.A., Singh, P.K., et al. (2020). Genome-wide association studies in tropical maize germplasm reveal novel and known genomic regions for resistance to Northern corn leaf blight. Scientific reports 10(1), 1-16.
Raun, W.R., Solie, J.B., Johnson, G.V., Stone, M.L., Lukina, E.V., Thomason, W.E., et al. (2001). In-season prediction of potential grain yield in winter wheat using canopy reflectance. Agronomy Journal 93(1), 131-138.
Ravi, D., Wong, C., Lo, B., and Yang, G.-Z. (Year). "Deep learning for human activity recognition: A resource efficient implementation on low-power devices", in: 2016 IEEE 13th international conference on wearable and implantable body sensor networks (BSN): IEEE), 71-76.
Ray, D.K., Mueller, N.D., West, P.C., and Foley, J.A. (2013). Yield trends are insufficient to double global crop production by 2050. PloS one 8(6), e66428.
273
Ray, J.D., Dhanapal, A.P., Singh, S.K., Hoyos-Villegas, V., Smith, J.R., Purcell, L.C., et al. (2015). Genome-wide association study of ureide concentration in diverse maturity group IV soybean [Glycine max (L.) Merr.] accessions. G3: Genes, Genomes, Genetics 5(11), 2391-2403.
Rebetzke, G., Jimenez-Berni, J., Fischer, R., Deery, D., and Smith, D. (2019). High-throughput phenotyping to enhance the use of crop genetic resources. Plant Science 282, 40-48.
Rębilas, K., Klimek-Kopyra, A., Bacior, M., and Zając, T. (2020). A model for the yield losses estimation in an early soybean (Glycine max (L.) Merr.) cultivar depending on the cutting height at harvest. Field Crops Research 254, 107846.
Remington, D.L., Thornsberry, J.M., Matsuoka, Y., Wilson, L.M., Whitt, S.R., Doebley, J., et al. (2001). Structure of linkage disequilibrium and phenotypic associations in the maize genome. Proceedings of the National Academy of Sciences 98(20), 11479-11484.
Reynolds, M. (2001). Application of physiology in wheat breeding. Cimmyt. Ribeiro, M.H.D.M., and dos Santos Coelho, L. (2020). Ensemble approach based on
bagging, boosting and stacking for short-term prediction in agribusiness time series. Applied Soft Computing 86, 105837.
Richards, R. (1982a). Breeding and selecting for drought resistance in wheat. Drought resistance in crops with emphasis on rice, 303-316.
Richards, R. (1982b). Breeding and selecting for drought resistant wheat. p. 303–316. Drought resistance in crops with emphasis on rice. IRRI, Manila, Philippines. Breeding and selecting for drought resistant wheat. p. 303–316. In Drought resistance in crops with emphasis on rice. IRRI, Manila, Philippines., -.
Richards, R. (2000). Selectable traits to increase crop photosynthesis and yield of grain crops. Journal of experimental botany 51(suppl_1), 447-458.
Richter, R., Reu, B., Wirth, C., Doktor, D., and Vohland, M. (2016). The use of airborne hyperspectral data for tree species classification in a species-rich Central European forest area. International journal of applied earth observation and geoinformation 52, 464-474.
Ripley, B., Venables, B., Bates, D.M., Hornik, K., Gebhardt, A., Firth, D., et al. (2013). Package ‘mass’. Cran R 538.
Robbins, M.D., and Staub, J.E. (2009). Comparative analysis of marker-assisted and phenotypic selection for yield components in cucumber. Theoretical and applied genetics 119(4), 621-634.
Robinson, A.P., Conley, S.P., Volenec, J.J., and Santini, J.B. (2009). Analysis of high yielding, early‐planted soybean in Indiana. Agronomy Journal 101(1), 131-139.
Robinson, G.K. (1991). That BLUP is a good thing: the estimation of random effects. Statistical science 6(1), 15-32.
274
Rocha, A., Groen, T., Skidmore, A., Darvishzadeh, R., and Willemen, L. (2018). Machine learning using hyperspectral data inaccurately predicts plant traits under spatial dependency. Remote sensing 10(8), 1263.
Rodriguez-Galiano, V., Sanchez-Castillo, M., Chica-Olmo, M., and Chica-Rivas, M. (2015). Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees and support vector machines. Ore Geology Reviews 71, 804-818.
Rodriguez, D., Fitzgerald, G., Belford, R., and Christensen, L. (2006). Detection of nitrogen deficiency in wheat from spectral reflectance indices and basic crop eco-physiological concepts. Australian Journal of Agricultural Research 57(7), 781-789.
Rossel, R.A.V. (2008). ParLeS: Software for chemometric analysis of spectroscopic data. Chemometrics intelligent laboratory systems 90(1), 72-83.
Rossi, E.S., Kuki, M.C., Pinto, R.J., Scapim, C.A., Faria, M.V., and De Leon, N. (2020). Genomic-wide association study for white spot resistance in a tropical maize germplasm. Euphytica 216(1), 1-15.
Rotundo, J.L., Borrás, L., Westgate, M.E., and Orf, J.H. (2009). Relationship between assimilate supply per seed during seed filling and soybean seed composition. Field crops research 112(1), 90-96.
Roujean, J.-L., and Breon, F.-M. (1995). Estimating PAR absorbed by vegetation from bidirectional reflectance measurements. Remote sensing of Environment 51(3), 375-384.
Roy, K., Kar, S., and Das, R.N. (2015). "Statistical methods in QSAR/QSPR," in A primer on QSAR/QSPR modeling. Springer), 37-59.
Royo, C., Alvaro, F., Martos, V., Ramdani, A., Isidro, J., Villegas, D., et al. (2007). Genetic changes in durum wheat yield components and associated traits in Italian and Spanish varieties during the 20th century. Euphytica 155(1-2), 259-270.
Royo, C., Aparicio, N., Villegas, D., Casadesus, J., Monneveux, P., and Araus, J. (2003). Usefulness of spectral reflectance indices as durum wheat yield predictors under contrasting Mediterranean conditions. International Journal of Remote Sensing 24(22), 4403-4419.
Ruan, Y.-L., Patrick, J.W., Bouzayen, M., Osorio, S., and Fernie, A.R. (2012). Molecular regulation of seed and fruit set. Trends in plant science 17(11), 656-665.
Rutherford, A. (2001). Introducing ANOVA and ANCOVA: a GLM approach. London, UK: Sage.
Rutkoski, J., Poland, J., Mondal, S., Autrique, E., Párez, L.G., Crossa, J., et al. (2016a). Canopy temperature and vegetation indices from high-throughput phenotyping improve accuracy of pedigree and genomic selection for grain yield in wheat. G3: Genes, Genomes, Genetics, g3. 116.032888.
275
Rutkoski, J., Poland, J., Mondal, S., Autrique, E., Pérez, L.G., Crossa, J., et al. (2016b). Canopy temperature and vegetation indices from high-throughput phenotyping improve accuracy of pedigree and genomic selection for grain yield in wheat. G3: Genes, Genomes, Genetics 6(9), 2799-2808.
Sah, R., Chakraborty, M., Prasad, K., Pandit, M., Tudu, V., Chakravarty, M., et al. (2020). Impact of water deficit stress in maize: Phenology and yield components. Scientific reports 10(1), 1-15.
Sakamoto, T., Gitelson, A.A., and Arkebauer, T.J. (2013). MODIS-based corn grain yield estimation model incorporating crop phenology information. Remote Sensing of Environment 131, 215-231.
Savitzky, A., and Golay, M.J. (1964). Smoothing and differentiation of data by simplified least squares procedures. Analytical chemistry 36(8), 1627-1639.
Schapire, R.E. (Year). "A brief introduction to boosting", in: Ijcai), 1401-1406. Schmutz, J., Cannon, S.B., Schlueter, J., Ma, J., Mitros, T., Nelson, W., et al. (2010).
Genome sequence of the palaeopolyploid soybean. nature 463(7278), 178. Schut, A.G., Traore, P.C.S., Blaes, X., and Rolf, A. (2018). Assessing yield and fertilizer
response in heterogeneous smallholder fields with UAVs and satellites. Field Crops Research 221, 98-107.
Schweiger, A.K., Cavender-Bares, J., Townsend, P.A., Hobbie, S.E., Madritch, M.D., Wang, R., et al. (2018). Plant spectral diversity integrates functional and phylogenetic components of biodiversity and predicts ecosystem function. Nature ecology evolution 2, 976–982.
Segura, V., Vilhjálmsson, B.J., Platt, A., Korte, A., Seren, Ü., Long, Q., et al. (2012). An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nature genetics 44(7), 825.
Seni, G., and Elder, J.F. (2010). Ensemble methods in data mining: improving accuracy through combining predictions. Synthesis lectures on data mining and knowledge discovery 2(1), 1-126.
Serneels, S., and Verdonck, T. (2008). Principal component analysis for data containing outliers and missing elements. Computational Statistics & Data Analysis 52(3), 1712-1727.
Serrano, L., Filella, I., and Penuelas, J. (2000). Remote sensing of biomass and yield of winter wheat under different nitrogen supplies. Crop science 40(3), 723-731.
Šestak, I., Zgorelec, Ž., Perčin, A., Mesić, M., and Galić, M. (2019). "Prediction of soybean leaf nitrogen content using proximal field spectroscopy", in: 54th Croatian & 14th International Symposium on Agriculture (Vodice, Croatia).
Severin, A.J., Woody, J.L., Bolon, Y.-T., Joseph, B., Diers, B.W., Farmer, A.D., et al. (2010). RNA-Seq Atlas of Glycine max: a guide to the soybean transcriptome. BMC plant biology 10(1), 1-16.
276
Sha, W., Guo, Y., Yuan, Q., Tang, S., Zhang, X., Lu, S., et al. (2020). Artificial Intelligence to Power the Future of Materials Science and Engineering. Advanced Intelligent Systems 2(4), 1900143.
Shao, Y., Zhao, C., Bao, Y., and He, Y. (2012). Quantification of nitrogen status in rice by least squares support vector machines and reflectance spectroscopy. Food Bioprocess Technology 5(1), 100-107.
Shen, D., Wu, G., and Suk, H.-I. (2017). Deep learning in medical image analysis. Annual review of biomedical engineering 19, 221-248.
Shigang, F., and Qian, A. (2007). Multi-objective reactive power optimization using SPEA2. High Voltage Engineering 33(9), 115-119.
Siebers, M.H., Yendrek, C.R., Drag, D., Locke, A.M., Rios Acosta, L., Leakey, A.D., et al. (2015). Heat waves imposed during early pod development in soybean (G lycine max) cause significant yield loss despite a rapid recovery from oxidative stress. Global change biology 21(8), 3114-3125.
Siegmann, B., and Jarmer, T. (2015). Comparison of different regression models and validation techniques for the assessment of wheat leaf area index from hyperspectral data. International journal of remote sensing 36(18), 4519-4534.
Silva, J.C.F., Teixeira, R.M., Silva, F.F., Brommonschenkel, S.H., and Fontes, E.P. (2019). Machine learning approaches and their current application in plant molecular biology: A systematic review. Plant Science 284, 37-47.
Simopoulos, C.M., Weretilnyk, E.A., and Golding, G.B. (2018). Prediction of plant lncRNA by ensemble machine learning classifiers. BMC genomics 19(1), 316.
Sims, D.A., and Gamon, J.A. (2002). Relationships between leaf pigment content and spectral reflectance across a wide range of species, leaf structures and developmental stages. Remote sensing of environment 81(2-3), 337-354.
Singh, S., Sehgal, D., Kumar, S., Arif, M., Vikram, P., Sansaloni, C., et al. (2020). GWAS revealed a novel resistance locus on chromosome 4D for the quarantine disease Karnal bunt in diverse wheat pre-breeding germplasm. Scientific reports 10(1), 1-11.
Slafer, G. (2003). Genetic basis of yield as viewed from a crop physiologist's perspective. Annals of Applied Biology 142(2), 117-128.
Slinkard, A., Solh, M., and Vandenberg, A. (2000). "Breeding for yield: the direct approach," in Linking Research and Marketing Opportunities for Pulses in the 21st Century. (Dordrecht, The Netherlands: Springer), 183-190.
Somegowda, V.K., Rayaprolu, L., Rathore, A., Deshpande, S.P., and Gupta, R. (2021). Genome-Wide Association Studies (GWAS) for Traits Related to Fodder Quality and Biofuel in Sorghum: Progress and Prospects. Protein and Peptide Letters.
Sonah, H., O'Donoughue, L., Cober, E., Rajcan, I., and Belzile, F. (2015). Identification of loci governing eight agronomic traits using a GBS‐GWAS approach and
277
validation by QTL mapping in soya bean. Plant biotechnology journal 13(2), 211-221.
Song, J., Sun, X., Zhang, K., Liu, S., Wang, J., Yang, C., et al. (2020). Identification of QTL and genes for pod number in soybean by linkage analysis and genome-wide association studies. Molecular Breeding 40(6), 1-14.
Specht, J., Hume, D., and Kumudini, S. (1999). Soybean yield potential—a genetic and physiological perspective. Crop science 39(6), 1560-1570.
Stephens, M., and Balding, D.J. (2009). Bayesian statistical methods for genetic association studies. Nature Reviews Genetics 10(10), 681-690.
Streitner, C., Danisman, S., Wehrle, F., Schöning, J.C., Alfano, J.R., and Staiger, D. (2008). The small glycine‐rich RNA binding protein AtGRP7 promotes floral transition in Arabidopsis thaliana. The Plant Journal 56(2), 239-250.
Stroup, W., and Mulitze, D. (1991). Nearest neighbor adjusted best linear unbiased prediction. The American Statistician 45(3), 194-200.
Su, C.-T., and Yang, C.-H. (2008). Feature selection for the SVM: An application to hypertension diagnosis. Expert Systems with Applications 34(1), 754-763.
Su, Q., Lu, W., Du, D., Chen, F., Niu, B., and Chou, K.-C. (2017). Prediction of the aquatic toxicity of aromatic compounds to tetrahymena pyriformis through support vector regression. Oncotarget 8(30), 49359.
Suhre, J.J., Weidenbenner, N.H., Rowntree, S.C., Wilson, E.W., Naeve, S.L., Conley, S.P., et al. (2014). Soybean yield partitioning changes revealed by genetic gain and seeding rate interactions. Agronomy Journal 106(5), 1631-1642.
Sui, M., Wang, Y., Bao, Y., Wang, X., Li, R., Lv, Y., et al. (2020). Genome‐wide association analysis of sucrose concentration in soybean (Glycine max L.) seed based on high‐throughput sequencing. The Plant Genome 13(3), e20059.
Sulistyo, A., and Sari, K. (Year). "Correlation, path analysis and heritability estimation for agronomic traits contribute to yield on soybean", in: IOP Conference Series: Earth and Environmental Science), 012034.
Sun, S., Dong, B., and Zou, Q. (2020a). Revisiting genome-wide association studies from statistical modelling to machine learning. Briefings in Bioinformatics.
Sun, S., Wang, C., Ding, H., and Zou, Q. (2020b). Machine learning and its applications in plant molecular studies. Briefings in Functional Genomics 19(1), 40-48.
Sun, X., Shantharaj, D., Kang, X., and Ni, M. (2010). Transcriptional and hormonal signaling control of Arabidopsis seed development. Current opinion in plant biology 13(5), 611-620.
Suykens, J.A., and Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural processing letters 9(3), 293-300.
Sweeney, D.W., Long, J.H., and Kirkham, M. (2003). A single irrigation to improve early maturing soybean yield and quality. Soil Science Society of America Journal 67(1), 235-240.
278
Szymczak, S., Biernacka, J.M., Cordell, H.J., González‐Recio, O., König, I.R., Zhang, H., et al. (2009). Machine learning in genome‐wide association studies. Genetic epidemiology 33(S1), S51-S57.
Tagarakis, A.C., Ketterings, Q.M., Lyons, S., and Godwin, G. (2017). Proximal sensing to estimate yield of brown midrib forage sorghum. Agronomy Journal 109(1), 107-114.
Taillardat, M., Mestre, O., Zamo, M., and Naveau, P. (2016). Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Monthly Weather Review 144(6), 2375-2393.
Tamiru, A., Paliwal, R., Manthi, S.J., Odeny, D.A., Midega, C.A., Khan, Z.R., et al. (2020). Genome wide association analysis of a stemborer egg induced “call-for-help” defence trait in maize. Scientific reports 10(1), 1-12.
Tamouridou, A.A., Alexandridis, T.K., Pantazi, X.E., Lagopodi, A.L., Kashefi, J., Kasampalis, D., et al. (2017). Application of multilayer perceptron with automatic relevance determination on weed mapping using UAV multispectral imagery. Sensors 17(10), 2307.
Tan, M., Wang, L., and Tsang, I.W. (Year). "Learning sparse svm for feature selection on very high dimensional datasets", in: ICML).
Tang, R., Sinnwell, J.P., Li, J., Rider, D.N., de Andrade, M., and Biernacka, J.M. (Year). "Identification of genes and haplotypes that predict rheumatoid arthritis using random forests", in: BMC proceedings: BioMed Central), 1-5.
Tardieu, F., Cabrera-Bosquet, L., Pridmore, T., and Bennett, M. (2017). Plant phenomics, from sensors to knowledge. Current Biology 27(15), R770-R783.
Taylor, J. (1997). Introduction to error analysis, the study of uncertainties in physical measurements.
Temesgen, D., and Assefa, F. (2020). Inoculation of native symbiotic effective Sinorhizobium spp. enhanced soybean [Glycine max (L.) Merr.] grain yield in Ethiopia. Environmental Systems Research 9(1), 1-19.
Tester, M., and Langridge, P. (2010). Breeding technologies to increase crop production in a changing world. Science 327(5967), 818-822.
Thenkabail, P.S., Mariotto, I., Gumma, M.K., Middleton, E.M., Landis, D.R., and Huemmrich, K.F. (2013). Selection of hyperspectral narrowbands (HNBs) and composition of hyperspectral twoband vegetation indices (HVIs) for biophysical characterization and discrimination of crop types using field reflectance and Hyperion/EO-1 data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 6(2), 427-439.
Thenkabail, P.S., Smith, R.B., and De Pauw, E. (2000). Hyperspectral vegetation indices and their relationships with agricultural crop characteristics. Remote sensing of Environment 71(2), 158-182.
279
Thomson, G. (1995). Mapping disease genes: family-based association studies. American journal of human genetics 57(2), 487.
Tian, Y., Yao, X., Yang, J., Cao, W., Hannaway, D., and Zhu, Y. (2011a). Assessing newly developed and published vegetation indices for estimating rice leaf nitrogen concentration with ground-and space-based hyperspectral reflectance. Field Crops Research 120(2), 299-310.
Tian, Y., Zhao, C., Lu, S., and Guo, X. (2011b). Multiple classifier combination for recognition of wheat leaf diseases. Intelligent Automation & Soft Computing 17(5), 519-529.
Tibbs Cortes, L., Zhang, Z., and Yu, J. Status and prospects of genome‐wide association studies in plants. The Plant Genome, e20077.
Tilman, D., Socolow, R., Foley, J.A., Hill, J., Larson, E., Lynd, L., et al. (2009). Beneficial biofuels—the food, energy, and environment trilemma. Science 325(5938), 270-271.
Tiwari, P., and Shukla, P. (2020). "Artificial Neural Network-Based Crop Yield Prediction Using NDVI, SPI, VCI Feature Vectors," in Information and Communication Technology for Sustainable Development. Springer), 585-594.
Topol, E.J. (2020). Welcoming new guidelines for AI clinical research. Nature medicine 26(9), 1318-1320.
Torkamaneh, D., Laroche, J., Bastien, M., Abed, A., and Belzile, F. (2017). Fast-GBS: a new pipeline for the efficient and highly accurate calling of SNPs from genotyping-by-sequencing data. BMC bioinformatics 18(1), 5.
Tsai, H.-Y., Janss, L.L., Andersen, J.R., Orabi, J., Jensen, J.D., Jahoor, A., et al. (2020). Genomic prediction and GWAS of yield, quality and disease-related traits in spring barley and winter wheat. Scientific reports 10(1), 1-15.
Tucker, C.J. (1979). Red and photographic infrared linear combinations for monitoring vegetation. Remote sensing of Environment 8(2), 127-150.
Tudcer, C.J., Elgin Jr, J.H., and McMurtrey, J.E. (1978). Monitoring Corn and Soybean Crop Development by Remote Sensing Techniques.
Tugizimana, F., Engel, J., Salek, R., Dubery, I., Piater, L., and Burgess, K. (2020). "The disruptive 4IR in the life sciences: Metabolomics," in The Disruptive Fourth Industrial Revolution. Springer), 227-256.
Tulpan, D. (2020). 311 A brief overview, comparison and practical applications of machine learning models. Journal of Animal Science 98(Supplement_4), 44-45.
Tyug, T.S., Prasad, K.N., and Ismail, A. (2010). Antioxidant capacity, phenolics and isoflavones in soybean by-products. Food chemistry 123(3), 583-589.
Van Der Meij, B., Kooistra, L., Suomalainen, J., Barel, J.M., and De Deyn, G.B. (2017). Remote sensing of plant trait responses to field-based plant-soil feedback using UAV-based optical sensors. Biogeosciences 14(3), 733.
280
Vapnik, V. (2000). The nature of statistical learning theory. SpringerVerlag New York. Inc., New York, NY, USA.
Vapnik, V. (2013). The nature of statistical learning theory. Springer science & business media.
Vapnik, V.N. (1995). The nature of statistical learning. Theory. Verma, A.K., Garg, P.K., Hari Prasad, K., and Dadhwal, V.K. (2020). Modelling of
sugarcane yield using LISS-IV data based on ground LAI and yield observations. Geocarto International 35(8), 887-904.
Vigier, B.J., Pattey, E., and Strachan, I.B. (2004). Narrowband vegetation indexes and detection of disease damage in soybeans. IEEE Geoscience and Remote Sensing Letters 1(4), 255-259.
Voldeng, H., Cober, E., Hume, D., Gillard, C., and Morrison, M. (1997). Fifty-eight years of genetic improvement of short-season soybean cultivars in Canada. Crop Science 37(2), 428-431.
Vuong, T., Sonah, H., Meinhardt, C., Deshmukh, R., Kadam, S., Nelson, R., et al. (2015). Genetic architecture of cyst nematode resistance revealed by genome-wide association study in soybean. BMC genomics 16(1), 1-13.
Wan, L., Cen, H., Zhu, J., Zhang, J., Zhu, Y., Sun, D., et al. (2020). Grain yield prediction of rice using multi-temporal UAV-based RGB and multispectral images and model transfer–a case study of small farmlands in the South of China. Agricultural and Forest Meteorology 291, 108096.
Wang, H.-j., and Murphy, P.A. (1994). Isoflavone content in commercial soybean foods. Journal of agricultural and food chemistry 42(8), 1666-1673.
Wang, H., Khoshgoftaar, T.M., and Napolitano, A. (2012). Software measurement data reduction using ensemble techniques. Neurocomputing 92, 124-132.
Wang, J., Zhang, X., Cheng, L., and Luo, Y. (2020). An overview and metanalysis of machine and deep learning-based CRISPR gRNA design tools. RNA biology 17(1), 13-22.
Wang, L., Hu, X., Jiao, C., Li, Z., Fei, Z., Yan, X., et al. (2016). Transcriptome analyses of seed development in grape hybrids reveals a possible mechanism influencing seed size. BMC genomics 17(1), 1-15.
Wang, L., Wang, T.-g., and Luo, Y. (2011). Improved non-dominated sorting genetic algorithm (NSGA)-II in multi-objective optimization studies of wind turbine blades. Applied Mathematics Mechanics 32(6), 739-748.
Wang, S.-C. (2003). "Genetic algorithm," in Interdisciplinary Computing in Java Programming. Springer), 101-116.
Wang, Y., and Gao, W. (2018). Prediction of the water content of biodiesel using ANN-MLP: An environmental application. Energy Sources, Part A: Recovery, Utilization, and Environmental Effects 40(8), 987-993.
281
Wang, Y., Wang, F., Huang, J., Wang, X., and Liu, Z. (2009). Validation of artificial neural network techniques in the estimation of nitrogen concentration in rape using canopy hyperspectral reflectance data. International Journal of Remote Sensing 30(17), 4493-4505.
Watanabe, K., Guo, W., Arai, K., Takanashi, H., Kajiya-Kanegae, H., Kobayashi, M., et al. (2017). High-throughput phenotyping of sorghum plant height using an unmanned aerial vehicle and its application to genomic prediction modeling. Frontiers in plant science 8, 421.
Watanabe, S., Hideshima, R., Xia, Z., Tsubokura, Y., Sato, S., Nakamoto, Y., et al. (2009). Map-based cloning of the gene associated with the soybean maturity locus E3. Genetics 182(4), 1251-1262.
Weber, H., Borisjuk, L., and Wobus, U. (2005). Molecular physiology of legume seed development. Annu. Rev. Plant Biol. 56, 253-279.
Weber, V., Araus, J., Cairns, J., Sanchez, C., Melchinger, A., and Orsini, E. (2012). Prediction of grain yield using reflectance spectra of canopy and leaves in maize plants grown under different water regimes. Field Crops Research 128, 82-90.
Wei, M.C.F., and Molin, J.P. (2020). Soybean Yield Estimation and Its Components: A Linear Regression Approach. Agriculture 10(8), 348.
Wen, Y.-J., Zhang, H., Ni, Y.-L., Huang, B., Zhang, J., Feng, J.-Y., et al. (2018). Methodological implementation of mixed linear models in multi-locus genome-wide association studies. Briefings in bioinformatics 19(4), 700-712.
Wen, Z., Tan, R., Yuan, J., Bales, C., Du, W., Zhang, S., et al. (2014). Genome-wide association mapping of quantitative resistance to sudden death syndrome in soybean. BMC genomics 15(1), 1-11.
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., and Vapnik, V. (Year). "Feature selection for SVMs", in: Advances in neural information processing systems), 668-674.
Wickham, H. (2011). ggplot2. Wiley Interdisciplinary Reviews: Computational Statistics 3(2), 180-185.
Wickham, H., Chang, W., and Wickham, M.H. (2016). Package ‘ggplot2’. Create Elegant Data Visualisations Using the Grammar of Graphics. Version 2(1), 1-189.
Wiegand, C.L., and Richardson, A.J. (1990). Use of spectral vegetation indices to infer leaf area, evapotranspiration and yield: I. Rationale. Agronomy Journal 82(3), 623-629.
Wilamowski, B.M., and Jaeger, R.C. (Year). "Implementation of RBF type networks by MLP networks", in: Proceedings of International Conference on Neural Networks (ICNN'96): IEEE), 1670-1675.
Williamson, B.D., Gilbert, P.B., Simon, N.R., and Carone, M. (2020). A unified approach for inference on algorithm-agnostic variable importance. arXiv preprint arXiv:2004.03683.
282
Wilson, R.F. (2012). The role of genomics and biotechnology in achieving global food security for high-oleic vegetable oil. Journal of oleo science 61(7), 357-367.
Woolley, J.T. (1971). Reflectance and transmittance of light by leaves. Plant physiology 47(5), 656-662.
Wu, C., Niu, Z., Tang, Q., Huang, W., Rivard, B., and Feng, J. (2009). Remote estimation of gross primary production in wheat using chlorophyll-related vegetation indices. Agricultural and Forest Meteorology 149(6-7), 1015-1021.
Wu, L., Chang, Y., Wang, L., Wu, J., and Wang, S. (2021). Genetic dissection of drought resistance based on root traits at the bud stage in common bean. Theoretical and Applied Genetics, 1-15.
Wu, Y., and Liu, Y. (2009a). Variable selection in quantile regression. Statistica Sinica, 801-817.
Wu, Y., and Liu, Y. (2009b). Variable selection in quantile regression. Statistica Sinica 19(2), 801.
Wu, Y., Miao, Q., Ma, W., Gong, M., and Wang, S. (2017). PSOSAC: particle swarm optimization sample consensus algorithm for remote sensing image registration. IEEE Geoscience and Remote Sensing Letters 15(2), 242-246.
Xavier, A., Jarquin, D., Howard, R., Ramasubramanian, V., Specht, J.E., Graef, G.L., et al. (2018). Genome-wide analysis of grain yield stability and environmental interactions in a multiparental soybean population. G3: Genes, Genomes, Genetics 8(2), 519-529.
Xavier, A., Muir, W.M., and Rainey, K.M. (2016a). Assessing predictive properties of genome-wide selection in soybeans. G3: Genes, Genomes, Genetics 6(8), 2611-2616.
Xavier, A., Muir, W.M., and Rainey, K.M. (2016b). Impact of imputation methods on the amount of genetic variation captured by a single-nucleotide polymorphism panel in soybeans. BMC bioinformatics 17(1), 1-9.
Xavier, A., and Rainey, K.M. (2020). Quantitative Genomic Dissection of Soybean Yield Components. G3: Genes, Genomes, Genetics 10(2), 665-675.
Xiong, Q., Tang, G., Zhong, L., He, H., and Chen, X. (2018). Response to Nitrogen Deficiency and Compensation on Physiological Characteristics, Yield Formation, and Nitrogen Utilization of Rice. Frontiers in Plant Science 9(1075). doi: 10.3389/fpls.2018.01075.
Xue, L., Cao, W., Luo, W., Dai, T., and Zhu, Y. (2004). Monitoring leaf nitrogen status in rice with canopy spectral reflectance. Agronomy Journal 96(1), 135-142.
Yadav, A.K., and Singla, D. (2020). VacPred: Sequence-based prediction of plant vacuole proteins using machine-learning techniques. Journal of Biosciences 45(1), 1-9.
Yamamoto, K. (2019). Distillation of crop models to learn plant physiology theories using machine learning. PloS one 14(5), e0217075.
283
Yang, J., Ferreira, T., Morris, A.P., Medland, S.E., Madden, P.A., Heath, A.C., et al. (2012). Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nature genetics 44(4), 369-375.
Yang, J., Yeh, C.-T.E., Ramamurthy, R.K., Qi, X., Fernando, R.L., Dekkers, J.C., et al. (2018). Empirical comparisons of different statistical models to identify and validate kernel row number-associated variants from structured multi-parent mapping populations of maize. G3: Genes, Genomes, Genetics 8(11), 3567-3575.
Yang, P., Hwa Yang, Y., B Zhou, B., and Y Zomaya, A. (2010). A review of ensemble methods in bioinformatics. Current Bioinformatics 5(4), 296-308.
Yao, L., Li, Y., Ma, C., Tong, L., Du, F., and Xu, M. (2020). Combined genome‐wide association study and transcriptome analysis reveal candidate genes for resistance to Fusarium ear rot in maize. Journal of integrative plant biology 62(10), 1535-1551.
Ye, Y., Chen, L., Wang, D., Li, T., Jiang, Q., and Zhao, M. (2009). SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging. Journal in computer virology 5(4), 283.
Yeom, S., Giacomelli, I., Menaged, A., Fredrikson, M., and Jha, S. (2020). Overfitting, robustness, and malicious algorithms: A study of potential causes of privacy risk in machine learning. Journal of Computer Security 28(1), 35-70.
Yilmaz, I., and Kaynar, O. (2011). Multiple regression, ANN (RBF, MLP) and ANFIS models for prediction of swell potential of clayey soils. Expert systems with applications 38(5), 5958-5966.
Yin, Z., Qi, H., Mao, X., Wang, J., Hu, Z., Wu, X., et al. (2018). QTL mapping of soybean node numbers on the main stem and meta-analysis for mining candidate genes. Biotechnology & Biotechnological Equipment 32(4), 915-922.
Yoosefzadeh-Najafabadi, M., Earl, H.J., Tulpan, D., Sulik, J., and Eskandari, M. (2021a). Application of machine learning algorithms in plant breeding: predicting yield from hyperspectral reflectance in soybean. Frontiers in plant science 11, 2169.
Yoosefzadeh-Najafabadi, M., Earl, H.J., Tulpan, D., Sulik, J., and Eskandari, M. (2021b). Application of Machine Learning Algorithms in Plant Breeding: Predicting Yield From Hyperspectral Reflectance in Soybean. Frontiers in Plant Science 11(2169). doi: 10.3389/fpls.2020.624273.
Yoosefzadeh-Najafabadi, M., Tulpan, D., and Eskandari, M. (2021c). Application of machine learning and genetic optimization algorithms for modeling and optimizing soybean yield using its component traits. Plos one 16(4), e0250665.
You, J., Li, X., Low, M., Lobell, D., and Ermon, S. (Year). "Deep gaussian process for crop yield prediction based on remote sensing data", in: Thirty-First AAAI conference on artificial intelligence).
284
Yu, B., Lam, H., Shao, G., and Liu, Y. (2005). Effects of salinity on activities of H+-ATPase, H+-PPase and membrane lipid composition in plasma membrane and tonoplast vesicles isolated from soybean (Glycine max L.) seedlings. Journal of environmental sciences (China) 17(2), 259-262.
Yu, K., Gnyp, M., Gao, J., Miao, Y., Chen, X., and Bareth, G. (Year). "Using Partial Least Squares (PLS) to Estimate Canopy Nitrogen and Biomass of Paddy Rice in China’s Sanjiang Plain", in: Proceedings of the Workshop on UAV-based Remote Sensing Methods for Monitoring Vegetation, edited by: Bendig, J. and Bareth, G., Kölner Geographische Arbeiten, Cologne), 99-103.
Yu, N., Li, L., Schmitz, N., Tian, L.F., Greenberg, J.A., and Diers, B.W. (2016). Development of methods to improve soybean yield estimation and predict plant maturity with an unmanned aerial vehicle based platform. Remote Sensing of Environment 187, 91-101.
Yuan, J., Njiti, V., Meksem, K., Iqbal, M., Triwitayakorn, K., Kassem, M.A., et al. (2002). Quantitative trait loci in two soybean recombinant inbred line populations segregating for yield and disease resistance. Crop science 42(1), 271-277.
Yun, Y., Chuluunsukh, A., and Gen, M. (2020). Sustainable closed-loop supply chain design problem: A hybrid genetic algorithm approach. Mathematics 8(1), 84.
Zaloga, A., Burakov, S., Yakimov, I., Gusev, K., and Dubinin, P. (Year). "Multi-population evolutionary algorithm SPEA2 for crystal structure determination from X-ray powder diffraction data", in: IOP Conference Series: Materials Science and Engineering: IOP Publishing), 012102.
Zarco‐Tejada, P.J., and Miller, J.R. (1999). Land cover mapping at BOREAS using red edge spectral parameters from CASI imagery. Journal of Geophysical Research: Atmospheres 104(D22), 27921-27933.
Zeng, Q., and Huang, H. (2014). A stable and optimized neural network model for crash injury severity prediction. Accident Analysis & Prevention 73, 351-358.
Zeng, Q., Huang, H., Pei, X., and Wong, S. (2016a). Modeling nonlinear relationship between crash frequency by severity and contributing factors by neural networks. Analytic methods in accident research 10, 12-25.
Zeng, Q., Huang, H., Pei, X., Wong, S., and Gao, M. (2016b). Rule extraction from an optimized neural network for traffic crash frequency modeling. Accident Analysis & Prevention 97, 87-95.
Zha, H., Miao, Y., Wang, T., Li, Y., Zhang, J., Sun, W., et al. (2020). Improving unmanned aerial vehicle remote sensing-based rice nitrogen nutrition index prediction with machine learning. Remote Sensing 12(2), 215.
Zhang, C., Pan, X., Li, H., Gardiner, A., Sargent, I., Hare, J., et al. (2018a). A hybrid MLP-CNN classifier for very fine resolution remotely sensed image classification. ISPRS Journal of Photogrammetry and Remote Sensing 140, 133-144.
285
Zhang, H., Hao, D., Sitoe, H.M., Yin, Z., Hu, Z., Zhang, G., et al. (2015a). Genetic dissection of the relationship between plant architecture and yield component traits in soybean (Glycine max) by association analysis across multiple environments. Plant Breeding 134(5), 564-572.
Zhang, J., Song, Q., Cregan, P.B., and Jiang, G.-L. (2016). Genome-wide association study, genomic prediction and marker-assisted selection for seed weight in soybean (Glycinemax). Theoretical and Applied Genetics 129(1), 117-130.
Zhang, J., Song, Q., Cregan, P.B., Nelson, R.L., Wang, X., Wu, J., et al. (2015b). Genome-wide association study for flowering time, maturity dates and plant height in early maturing soybean (Glycine max) germplasm. BMC genomics 16(1), 1-11.
Zhang, J., Song, Q., Cregan, P.B., Nelson, R.L., Wang, X., Wu, J., et al. (2015c). Genome-wide association study for flowering time, maturity dates and plant height in early maturing soybean (Glycine max) germplasm. BMC genomics 16(1), 217.
Zhang, J., Wang, X., Lu, Y., Bhusal, S.J., Song, Q., Cregan, P.B., et al. (2018b). Genome-wide Scan for Seed Composition Provides Insights into Soybean Quality Improvement and the Impacts of Domestication and Breeding. Molecular Plant 11(3), 460-472. doi: 10.1016/j.molp.2017.12.016.
Zhang, N., Pan, Y., Feng, H., Zhao, X., Yang, X., Ding, C., et al. (2019a). Development of Fusarium head blight classification index using hyperspectral microscopy images of winter wheat spikelets. Biosystems Engineering 186, 83-99.
Zhang, P., Yan, X., Gebrewahid, T.-W., Zhou, Y., Yang, E., Xia, X., et al. (2021a). Genome-wide association mapping of leaf rust and stripe rust resistance in wheat accessions using the 90K SNP array. Theoretical and Applied Genetics, 1-19.
Zhang, P., Yin, Z.-Y., Jin, Y.-F., and Chan, T.H. (2020). A novel hybrid surrogate intelligent model for creep index prediction based on particle swarm optimization and random forest. Engineering Geology 265, 105328.
Zhang, S., Du, H., Ma, Y., Li, H., Kan, G., and Yu, D. (2021b). Linkage and association study discovered loci and candidate genes for glycinin and β-conglycinin in soybean (Glycine max L. Merr.). Theoretical and Applied Genetics, 1-15.
Zhang, X.-D. (2020). "Machine learning," in A Matrix Algebra Approach to Artificial Intelligence. Springer), 223-440.
Zhang, Z., Ersoz, E., Lai, C.-Q., Todhunter, R.J., Tiwari, H.K., Gore, M.A., et al. (2010). Mixed linear model approach adapted for genome-wide association studies. Nature genetics 42(4), 355.
Zhang, Z., Jin, Y., Chen, B., and Brown, P. (2019b). California almond yield prediction at the orchard level with a machine learning approach. Frontiers in plant science 10, 809.
Zhao, B., Ma, B.L., Hu, Y., and Liu, J. (2015). Characterization of nitrogen and water status in oat leaves using optical sensing approach. Journal of the Science of Food and Agriculture 95(2), 367-378.
286
Zhao, F., Lei, W., Ma, W., Liu, Y., and Zhang, C. (2016). An improved SPEA2 algorithm with adaptive selection of evolutionary operators scheme for multiobjective optimization problems. Mathematical Problems in Engineering 2016.
Zhao, Y., Zhu, H., Lu, Z., Knickmeyer, R.C., and Zou, F. (2019). Structured genome-wide association studies with Bayesian hierarchical variable selection. Genetics 212(2), 397-415.
Zheng, W., Zhu, X., Wen, G., Zhu, Y., Yu, H., and Gan, J. (2020). Unsupervised feature selection by self-paced learning regularization. Pattern Recognition Letters 132, 4-11.
Zhou, W., Bellis, E.S., Stubblefield, J., Causey, J., Qualls, J., Walker, K., et al. (2019). Minor QTLs mining through the combination of GWAS and machine learning feature selection. BioRxiv, 712190.
Zhou, X., and Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nature genetics 44(7), 821.
Zhou, X., Zheng, H., Xu, X., He, J., Ge, X., Yao, X., et al. (2017). Predicting grain yield in rice using multi-temporal vegetation indices from UAV-based multispectral and digital imagery. ISPRS Journal of Photogrammetry and Remote Sensing 130, 246-255.
Zhou, Z.-H. (2012). Ensemble methods: foundations and algorithms. Boca Raton, FL, USA: CRC press.
Zhu, Y., Yao, X., Tian, Y., Liu, X., and Cao, W. (2008). Analysis of common canopy vegetation indices for indicating leaf nitrogen accumulations in wheat and rice. International Journal of Applied Earth Observation and Geoinformation 10(1), 1-10.
Ziliak, S. (2017). P values and the search for significance. Nature Methods 14(1), 3-4. Zitzler, E., Laumanns, M., and Thiele, L. (2001). SPEA2: Improving the strength Pareto
evolutionary algorithm. TIK-report 103. doi: doi:10.3929/ethz-a-004284029.