Correlation between Genetic Diversity and Fitness in a Predator-Prey Ecosystem Simulation

Correlation between Genetic Diversity andFitness in a Predator-Prey Ecosystem

Simulation

Marwa Khater, Elham Salehi, Robin Gras

School of Computer Science, University of WindsorON, Canada

{khater, salehie, rgras}@uwindsor.ca

Abstract. Biologists are interested in studying the relation between thegenetic diversity of a population and its fitness. We adopt the notion ofentropy as a measure of genetic diversity and correlate it with fitness ofan evolutionary ecosystem simulation. EcoSim is a predator-prey individ-ual based simulation which models co-evolving sexual individuals evolv-ing in a dynamic environment. The correlation values between entropyand fitness of all the species that ever existed during the whole simula-tion are presented. We show how entropy strongly correlates with fitnessand investigate the factors behind these results using machine learningtechniques. We build a classifier based on species different features andsuccessfully predict the resulting correlation value between entropy andfitness. Also the best features affecting the quality of classification arebeing investigated and chosen.

Keywords: artificial life modeling, individual-based modeling, geneticdiversity, entropy, fitness

1 Introduction

Genetic diversity serves as a way for populations to adapt to changing envi-ronments. With more variation, it is more likely that some individuals in apopulation will possess variations of alleles that are suited for the environment.Those individuals are more likely to survive to produce offspring bearing thatallele. The population will continue for more generations because of the successof these individuals. In summary, genetic diversity strengthens a population byincreasing the likelihood that at least some of the individuals will be able tosurvive major disturbances, and by making the group less susceptible to inher-ited disorders. Many biological studies showed that decreased population geneticdiversity can be associated with declines in population fitness. Because overallpopulation diversity affects both short-term individual fitness and long-termpopulation adaptive capacity, there is a need to develop an empirical quantita-tive understanding of the relationship between population genetic diversity andpopulation viability [10] [7] [16].

https://www.researchgate.net/publication/45100182_Genetic_diversity_and_fitness_in_multiple_environments?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==

https://www.researchgate.net/publication/228740794_Correlation_between_Fitness_and_Genetic_Diversity?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==

https://www.researchgate.net/publication/23453745_Positive_correlation_between_genetic_diversity_and_fitness_in_a_large_well-connected_metapopulation?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==

2 Marwa Khater, Elham Salehi, Robin Gras

Like in many disciplines, simulation modeling played a great role in studyingevolutionary processes. In this paper we investigate the relation between speciesfitness and species genetic diversity using EcoSim; an Individual based predator-prey ecosystem simulation. We adapt the notion of Shannon entropy, which isa measure of unpredictability and disorder in Information theory, as a measureof genetic diversity and study the correlation between it and species fitness. Wepresent the different correlation values obtained between entropy and fitness andinvestigate the factors behind these values. The rest of the paper is organized asfollows: A brief description of our model used is presented in Section 2. Section 3depicts the details of the entropy as a genetic diversity measure. The correlationresults between entropy and fitness are presented in Section 4. Furthermore,building a classifier for inference, and feature extraction is illustrated in Section5, followed by a summed up conclusion in Section6.

2 The Model

In order to investigate several open theoretic ecological questions we have de-signed the individual-based evolving predator-prey ecosystem simulation plat-form EcoSim introduced by Gras et al. [4] [5] [3]. Our objective is to study howindividual and local events can affect high level mechanisms such as communityformation, speciation or evolution. In this paper, we have used and extendedEcoSim, to compute and study the relation between genetic diversity and fit-ness. EcoSim uses Fuzzy Cognitive Map as a behavior model [6] which allows acombination of compactness with a very low computational requirement whilehaving the capacity to represent complex high level notions. The complex adap-tive agents (or individuals) of this simulation are either prey or predators whichact in a dynamic environment of 1000 x 1000 cells. Each cell may contain severalindividuals and some amount of food from which individuals gain energy. Preysconsume grass which is dynamically distributed, whereas predators predate onprey individuals. An individual consumes some energy each time it performs anaction such as evasion, search for food, eating and breeding. Each individual per-forms one action during a time step based on its perception of the environment.

Fuzzy Cognitive Map (FCM) [6] is used to model the individual’s behav-ior and to compute the next action to be performed. The individual’s genomeis coded in the FCM in through which evolution acts. Each agent possessesits unique proper FCM, and the system can still manage several hundreds ofthousands of such agents simultaneously into the world with reasonable com-putational requirements. A typical run lasts several tens of thousands of timesteps, during which, several hundreds of millions of agents will be born and sev-eral thousands of species [1] will be generated, allowing evolutionary process totake place and new behaviors to emerge to react to a constantly changing envi-ronment. A FCM is a graph which contains a set of nodes, each node being aconcept, and a set of edges, each edge representing the influence of one concepton another. In each FCM, three kinds of concepts are defined: sensitive (such asdistance to foe or food, amount of energy, etc), internal (fear, hunger, curiosity,

https://www.researchgate.net/publication/221672541_EcoSim_an_individual-based_platform_for_studying_evolution?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==

https://www.researchgate.net/publication/221151381_K-Means_Clustering_as_a_Speciation_Mechanism_within_an_Individual-Based_Evolving_Predator-Prey_Ecosystem_Simulation?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==

https://www.researchgate.net/publication/246279403_Fuzzy_cognitive_maps''International_Journal_of_Man-Machine_Studies?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==


Title Suppressed Due to Excessive Length 3

satisfaction, etc) and motor (evasion, socialization, exploration, breeding, etc.).The FCM serves as a genome for each individual. The genome length is fixedto 390 sites, where each site corresponds to an edge between two concepts ofthe FCM. In a breeding event, the FCM of the two parents is combined and istransmitted to their offspring after the possible addition of some mutations. Thebehavior model of each individual is therefore unique.

3 Entropy as a Measure of Genetic Diversity

Depending on the specific problem or representation being used, ranging frombiological domain to genetic programming, numerous diversity measures andmethods exist. The use of information theoretic measures such as Shannon en-tropy [13] or mutual information was controversial in many of the areas of biologythat aim to understand how organisms have evolved to deal with information,including behavioral biology, evolutionary ecology and genetics. Sherwin [14]showed that Shannon entropy proves its ability in measuring diversity in eco-logical community and genetics. He also highlighted the advantages of usingentropy based genetic diversity measures, along with surveying these diversitymeasures. A close relationship between; biological concepts of Darwinian fitnessand information-theoretic measures such as Shannon entropy or mutual informa-tion, was found. Furthermore, it was shown that in evolving biological systems,the fitness value of information is bounded above by the Shannon entropy [2].Shannon entropy can be considered as a measure of both diversity and random-ness. Shannon Information theory defines uncertainty (entropy) as the numberof bits needed to fully specify a situation, given a set of probabilities. These prob-abilities can be estimated by simply counting the abundance of each genotype(site) in the population. Therefore, these probabilities are only meaningful whencalculated with respect to population of individuals. The entropy content ofthe whole sequence (genome) is approximated by summing the per-site entropyand then summing over all sites in the sequence. This is only an approximationbecause it ignores interactions between sites (epistasis).

Shannon entropy measures the variation between genomes in a populationand mirrors the genomic diversity. The lower the entropy, the less diverse arethe genomes of a population and vice versa. There is a limit in the desired valuesof entropy, which if approaches maximum entropy indicates a completely non-uniform population close to randomness. On the other hand very low entropy(close to 0) means too much similarity between individual genomes which needto diverge more in order to learn and survive in their dynamic environment.When the entropy values are within a certain range (it does not approach anyextreme values; maximum entropy or 0), it could be considered as a desirablediversity indication. So a good balance between learning from the environment(low genetic diversity) and increasing the diversity (high genetic diversity) shouldbe met in order to ensure the well being of species. Initially all prey and predatorindividuals are given the same value for their genome respectively. Step after stepas more individuals are created, changes in the FCM occur. After the simulation

https://www.researchgate.net/publication/45267029_Entropy_and_Information_Approaches_to_Genetic_Diversity_and_its_Expression_Genomic_Geography?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==

https://www.researchgate.net/publication/244958444_A_Mathematical_Theory_of_Communication_The_Bell_System_Technical_Journal?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==

https://www.researchgate.net/publication/4130733_Shannon_information_and_biological_fitness?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==


stabilizes, we estimate the probability of all different values in each site basedon the observed frequencies in species population. In each time step we have avalue of entropy of all existing species, the maximum entropy, and also a valuefor global entropy of all prey and predator individuals respectively. We alsocalculate the fitness for every species as the average fitness of its individuals. Wedefine fitness of an individual as the age of death of the individual plus the ageof death of its entire offspring population. Accordingly, the fitness value mirrorsthe individual’s capability to survive longer and produce high number of strongadaptive offspring.

4 Measuring Correlation between Entropy and Fitness

EcoSim gives us the chance to study the relation between species genetic di-versity and species fitness, not only in certain environmental conditions and atspecific time like done in biological studies [7] [16] [8], but also through evolution.In EcoSim the environment changes from one place to another and from a timestep to another. Individuals that evolve in different parts of the world have dif-ferent information stored in their genome about the environment they evolve in.Furthermore, as we model a predator-prey system, we have co-evolution. Thismeans that the strategies (behaviors) of each kind are continuously changingtrying to adapt to the other kind. Thus there are many factors affecting the ge-netic diversity and fitness and controlling values of correlation between them. Atevery time step of the simulation we calculate entropy and fitness for all existingspecies which results in forming time series for each. In order to investigate theirpossible correlations, we first begin by calculating the Spearman’s cross corre-lation [15] between genetic diversity for all prey species and their fitness. TheSpearman measure ranks two sets of variables and tests for a linear relationshipbetween the variables’ ranks. A perfect Spearman correlation of +1 or -1 occurswhen each of the variables is a perfect monotone function of the other.

In our evolutionary ecosystem the effect of the diversity measure on fitnessis not immediate. There must be a time shift in the series where the effect onfitness of having increased or decreased diversity would be noticed. Also becausewe did not determine which attribute is the cause of the other we calculatethe correlation in both shift directions. We compute the Spearman correlationcoefficient, between these two time series for every possible shift between -s and+s time steps. Basically we correlate the entropy at time t with fitness at time t+ si where si ranges from -s to +s. Although there are many factors that mightaffect fitness beside genetic diversity, we managed to find strong correlationbetween entropy and fitness for all prey species. We present the cross-correlationcharts for some prey species in Fig.1. The x-axis in these charts represents thedifferent shifts for the time series. The y-axis represents the cross-correlationvalue at the corresponding shift. From the figure we see that not only differentspecies have different cross-correlation values, but also the same species correlatedifferently based on the time shift. Note that the dynamic environment, co-evolution and changing parameters with time, all affect species behavior. Thus,


https://www.researchgate.net/publication/227045220_Off-springtness_in_relation_to_population_size_and_genetic_variation_in_the_rare_perennial_plant_species_Gentiana_pneumonanthe_GentianaceaeOecologia_97_289-296?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==


https://www.researchgate.net/publication/246963053_Nonparametric_Statistics_For_The_Behavioral_Sciences?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==


Fig. 1. Different prey species correlation values between entropy and fitness. x-axisrepresents the different time shifts. Y-axis represents the correlation values.

correlation values for the same species might vary with time and through thecourse of evolution, a fact that is feasible to study in our model but not inbiological experiments. This fact encouraged us to add a time frame to the twoseries and measure correlation within the specific time frame. At each time stept we calculate entropy and fitness for all existing prey species. We then split thistime series into sliding windows of 200 time steps at each time step. Within eachwindow we calculate all possible correlations with different shifts +-s. Then wechoose the highest correlation value (whether positive or negative) and assign itto the prey species at that time step.

We present the results of 5 different runs of the simulation each runningfor 16,000 time steps and generating around 110,000 instances in average. Weassign three different classes to the correlation values. Correlation with valuesbetween -0.5 and 0.5 are class WEEK CORR which shows the situation wherethere is either no or weak correlation. Correlation values above 0.5 are highpositive (HIGHP) and correlation values below -0.5 are high negative (HIGHN)respectively. We calculate these correlation classes for all instances (which areeach species at every time step) in every run and present the percentage of eachclass with a window of 200 and maximum shift of 25 in both directions. Inaverage of 5 runs there is 26.8%, 38.4%, 34.6% for classes HIGHP, HIGHN andWEEK CORR respectively.

We investigate variations in window and shift values to better tune our model.Having a window of 200 and a maximum shift of 20 in both directions gave inaverage of 5 runs 17%, 29.6% and 53.4% for HIGHP, HIGHN and WEEK CORRcorrelation classes respectively. Increasing the window and maximum shift to 400and 50 was also tested. The average percentages were 23.7%, 27.5% and 48.8%for HIGHP, HIGHN and WEEK CORR classes respectively. Increasing the shift


values increase the percentage of high correlation instances, as more time isneeded to detect an increase in fitness after an increase in genetic diversity. Alsonote that increasing the window does not necessary increase the high correlationvalues as some fluctuations in the entropy or fitness time series could exist. Thevalues of shift that leads to the highest correlation values were also examined. Wefound that 37.7% of instances in average of 5 runs obtained highest correlationsfrom a positive shift between 10 and 25. In addition, 38.7% of instances in averageof 5 runs found highest correlation in negative shift between -10 and -25. It showsthat for more than 76% of the cases it need between 10 to 25 time steps to see theeffect of genetic diversity on the fitness or vice-versa. These values correspondroughly to 1 to 3 generations which seems a reasonable time to observe the effectof genetic variations in a population.

From the above discussion we observe high values for both negative andpositive correlations. These results support the claim of the great influence thegenetic diversity has on the well being of species. High positive correlation valuesmean that an increase in the genetic diversity, results in an increase in speciesfitness. The effect on fitness is not immediate so the time shift we defined helpsin detecting the correlation between the two values according to the time gap.There are many ways to interpret these results. A newly forming species witha small population would gradually tend to increase its genetic diversity andsubsequently positively correlates with the fitness. It is worth mentioning thatindividuals in EcoSim adapts to their constantly changing environment. Thisadaptation could be mirrored in the increase of similarity of the species FCM(and thus a decrease in entropy), as new interesting behavior for the currentenvironment has been discovered and then diffuse in the population. Negativecorrelations imply the fact that a species decrease diversity in order to reachstability, learn from the environment and reduce randomness. Our motivation tovalidate these results and further investigate the reason behind these correlationvalues encouraged us to build a classifier. The interest of building this classifier isfirst to see if some specific species properties can predict the current evolutionarybehavior of a species (that is if it is learning from the environment or increasingits diversity to be able to react to a future change in the environment). It canalso help to understand what are the factors and conditions that affect theevolutionary behavior. Therefore, we try to infer the correlation value knowingsome features about the species. If we are able to correctly classify unknowninstances based on a trained classifier, it would validate our correlation results.

5 Building Classifier for Inference

In order to validate our high correlation values found between entropy and fit-ness we make use of machine learning classifiers. We built a classifier to inferthe class correlation using decision trees. We use the C4.5 algorithm [9] withpruning, which builds a decision tree which is used as a classifier implementedin the WEKA data mining environment [17]. The C4.5 is a powerful tool whichalso provides decision rules that can help in the interpretation of the classifier.

https://www.researchgate.net/publication/30876208_Data_Mining_-_Practical_Machine_Learning_Tools_and_Techniques_with_JAVA_Implementations?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==

https://www.researchgate.net/publication/220688794_C45_Programs_For_Machine_Learning?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==


In order to build a classifier we had to choose the features that would best de-scribe the species and has direct effect on the species fitness. We choose featuresfrom both internal and physical concepts. These features are: the number ofindividuals in species, the time step, the average age of individuals in species,the average speed of the individuals and their average energy level. The averagenumber of reproduction events, average number of reproduction failing eventsand the spatial dispersal are also included. Average speed, fear, hunger and thespatial diversity of species distribution are also considered. In addition we alsoinclude the average activation level of fear, hunger, satisfaction, nuisance, curios-ity (which encourage individuals to move). Finally we include the entropy andfitness for each species. In total we have 16 features including the class variablewhich is the correlation with values HIGHN, HIGHP and WEEK CORR. Ournext step was to try to select the best features from these 16 features in order toboth simplify the model and extract knowledge behind selecting specific features.

5.1 Feature Selection

To increase the quality of the classifier we use feature selection in order to extractthe most important features from the above list. This step will provide moresemantics about which features most influence the value of correlation. We use awrapper feature selection method [18] [11] based on an estimation of distributionalgorithm (EDA) called CMSS-EDA [12]. Since CMSS-EDA does not considera small fix upper bound on the number of variables on which each variabledepends , we are able to find the most relevant features using this approach evenwhen there are lots of independencies between them . Each subset of featuresis encoded as a bit-string and we find the subset of variables which maximizesAUC (Area Under ROC Curve) obtained by a Bayesian network classifier. Thebest features chosen are population size, entropy, fitness, spatial dispersal, ageand reproduction fail. We ran the feature selection algorithm on all the 5 runsand they all found the same best features.

The best chosen features are population size, entropy, fitness, spatial disper-sal, age and reproduction fail. We ran the feature selection algorithm on all the5 runs and they all found the same best features. This fact shows the stabilityof the simulation which is important to be able to discover meaningful genericrules. Clearly, entropy and fitness are chosen among the best features as theyare the two features being correlated and subsequently have a direct effect onthe correlation class variable. But also fitness and entropy values determines thesign of correlation being either positive or negative. But this is not a bias inour analysis as what is measured here is how a specific value of either entropyor fitness, at a given time step, affects the future (or is affected by the past)correlation between fitness and entropy. Studying the effect the population size,which was among the selected features, has on fitness is a major study in biol-ogy. Some studies showed that population size and genetic variation are stronglypositively correlated with fitness [8]. Also, loss in fitness and genetic diversitywas accompanied by a drop in population size in [10]. Furthermore, positivecorrelation between genetic diversity fitness, and population size was shown in



https://www.researchgate.net/publication/6119033_WLD_review_of_feature_selection_techniques_in_bioinformatics?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==

https://www.researchgate.net/publication/221185339_Using_Feature_Selection_Approaches_to_Find_the_Dependent_Features?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==

https://www.researchgate.net/publication/220741884_Efficient_EDA_for_large_opimization_problems_via_constraining_the_search_space_of_models?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==


[16]. Another selected feature was spatial dispersal. It was also discovered thatspatial dispersal is a very important factor maintaining genetic diversity andsubsequently fitness [16]. The last two selected features are the average age andthe average reproduction fail. From the fitness definition we used,clearly thesetwo features have a direct effect on the fitness value as the higher the average ageof species population the higher its fitness. Also, the decrease in the reproduc-tion failure is accompanied by the increase in the fitness. The similarity betweenthe best features discovered by our system and the most significant biologicalfeatures affecting the genetic diversity and fitness is noticeable. Furthermore, thesignificance of the best features chosen highlights the validity of our calculationsand founding of the strong correlation between genetic diversity and fitness inour system.

Table 1. Percentage of high positive, high negative, week correlation and high corre-lation prey instances for five different runs.

Run Percentage Percentage Percentage PercentageHIGHP HIGHN WEEK CORR HIGH CORR

Run 1 13% 15.5% 71.5% 28.5%

Run 2 13.3% 17.6% 69.1% 30.9%

Run 3 11.3% 15.9% 72.8% 27.2%

Run 4 9.8% 11.4% 78.8% 21.2%

Run 5 11.8% 14.9% 73.3% 26.7%

Average 11.8% 14.9% 73.3% 26.7%

Table 2. Accuracy percentages for training and testing with the C4.5 classifier for 5runs of the simulation.

Run Train Test Accuracy Average Test Accuracy STD test accuracy NumberAccuracy on same run on other 4 runs on other runs of rules

Run 1 79.3% 80.3% 60.1% 4.9 294

Run 2 74.7% 75.3% 66.8% 0.8 307

Run 3 77.2% 78.1% 63.2% 3.2 280

Run 4 80.2% 80.2% 69.1% 2.6 181

Run 5 78% 78% 66.9% 3.8 263

Average 77.9% 78.4% 65.2% 3.1 265

5.2 Classification Results

We build a classifier using the C4.5 algorithm implemented in WEKA environ-ment. We use the window of 400 and fix the shift to 25 time steps for calculatingthe correlations. The reason behind that is to have all instances on the same scale




and thus comparable. Also, increasing the window for more than 400, subjectthe fitness and entropy series to fluctuations. Furthermore, decreasing the win-dow tend to influence the correlation results to higher correlations. We choose25 as a shift value based on the analysis of which shift leads to the highest cor-relations. Table 1 presents the percentages for HIGHP, HIGHN, WEEK CORRand the sum of HIGHP and HIGHN called HIGH CORR, for 5 runs running16,000 time steps and producing around 110,000 instances . The 6 features usedfor the model are the ones selected from the feature selection process. We splitthe instances for each 5 runs into 80% for training the classifier using 10-foldcross validation and 20% for testing with C4.5 pruning model. Table 2 presentstraining and testing accuracy on data set from the same run. We also testedtraining the classifier on data set from one run and testing on another dataset from the other runs of the simulation to infer generality of the model. Theconfusion matrix showed high true positive results for training and testing onthe same run. The results from testing on another run showed only high truepositive values in accuracies above 65%. This is due to the fact that each runhas variations in terms of attributes values and ranges and possible overfitting.Thus, the model was able to discover some rules that can make good predictionon unclassified instances. Also, these values are much higher than random clas-sification. The good classification accuracy on the same run shows the validityof our calculations of entropy as genetic diversity and its high correlation withfitness. It also shows that there exist specific conditions of the species that leadto a positive or negative correlation between fitness and genetic diversity.

6 Conclusion

In this paper we introduced the use of Shannon entropy as a measure of geneticdiversity of an individual based evolutionary ecosystem simulation. We foundvery high correlation both negative and positive between entropy and fitness.In order to validate our correlation results and further understand the reasonsbehind these results we built a classifier to predict the correlation class variablebased on training and testing sets. We found high accuracy for classification stepwhich proves the correctness of our genetic diversity measure and its correlationwith fitness. In addition, we used feature extraction to find the best featuresaffecting the correlation values. We showed how these extracted features aresimilar to the factors affecting genetic diversity and fitness in community ecol-ogy. The similarity between results of five different runs of the simulation provesthe stability of the simulation and the generality of our findings. This studyallows us to examine the relation between genetic diversity and fitness and thefactors behind this correlation. It also shows that this relation changes based ontime and other features such as reproduction rate, population size and spatialdispersal. In the future we shall investigate more about the values of these fea-tures and which values lead to negative or positive correlation which would havea great impact on community ecology domain.


Acknowledgments This work is supported by the NSERC grant ORGPIN341854, the CRC grant 950-2-3617 and the CFI grant 203617 and is made pos-sible by the facilities of the Shared Hierarchical Academic Research ComputingNetwork (SHARCNET:www.sharcnet.ca).

References

1. Aspinal, A., Gras, R.: K-means clustering as a speciation method within anindividual-based evolving predator-prey ecosystem simulation. Proc. of the ActiveMedia Technology, Lecture Notes in Comput. Sci. pp. 318–329 (2010)

2. Bergstrom, C., Lachmann, M.: Shannon information and biological fitness. In: InIEEE Information Theory Workshop. pp. 50–54 (2004)

3. EcoSim: An Ecosystem Simulation: http://sites.google.com/site/ecosimgroup/research/ecosystem-simulation

4. Gras, R., Devaurs, D., Wozniak, A., Aspinall, A.: An individual-based evolvingpredator-prey ecosystem simulation using fuzzy cognitive map as behavior model.Artificial Life 15(4), 423–463 (2009)

5. Gras, R., Golestani, A., Hosseini, M., Khater, M., Farahani, Y.M., Mashayekhi, M.,Ibne, S.M., Sajadi, A., Salehi, E., Scott, R.: Ecosim: an individual-based platformfor studying evolution. European Conference on Artificial Life (2011), in press

6. Kosko, B.: Fuzzy cognitive maps. Int. Jornal of Man-Machine Studies pp. 65–75(1986)

7. Markert, J., Champlin, D., Gutjahr-Gobell, R., Grear, J., Kuhn, A., McGreevy,T., Roth, A., Bagley, M., Nacci, D.: Population genetic diversity and fitness inmultiple environments. BMC Evolutionary Biology 10, 1471–2148–10–205 (2010)

8. Oostermeijer, J., van Eijck, M., den Nijs, J.: Offspring fitness in relation to popu-lation size and genetic variation in the rare perennial plant species gentiana pneu-monanthe (gentianceae). Oecologia 97, 289296 (1994)

9. Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers(1993)

10. Reed, D., Frankham, R.: Correlation between fitness and genetic diversity. ConservBiology 17, 230–237 (2003)

11. Saeys, Y., Inza, I., Larraaga, P.: A review of feature selection techniques in bioin-formatics. Bioinformatics 23(19), 2507–2517 (2007)

12. Salehi, E., Gras, R.: Efficient eda for large optimization problem via constrainingthe search space of models. In: Genetic and Evolutionary Computation Conference.Dublin, Ireland (2011), in press

13. Shannon, C.: A mathematical theory of communication. Bell Systems TechnicalJournal pp. 379–423 (1948)

14. Sherwin, W.B.: Entropy and information approaches to genetic diversity and itsexpression: Genomic geography. Entropy 12, 1765–1798 (2010)

15. Siegel, S.: Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, NewYork (1956)

16. Vandewoestijne, S., Schtickzelle, N., Baguette, M.: Positive correlation between ge-netic diversity and fitness in a large, well-connected metapopulation. BMC Biology6, 1741–7007–6–46 (2008)

17. Witten, I., Frank, E.: Data Mining- Practical Machine Learning Tools and Tech-niques with Java Implementations. Morgan Kaufmann, USA (2000)

https://www.researchgate.net/publication/26234726_An_Individual-Based_Evolving_Predator-Prey_Ecosystem_Simulation_Using_a_Fuzzy_Cognitive_Map_as_the_Behavior_Model?el=1_x_8&enrichId=rgreq-c327b104-364c-4de2-9a89-b6d913873e33&enrichSource=Y292ZXJQYWdlOzIyMDkzNDkxODtBUzoxMDQyMzkxMzcyOTYzODVAMTQwMTg2Mzk1NTQ4Mw==








































18. Yang, Q., Salehi, E., Gras, R.: Using feature selection approaches to find the de-pendent features. In: 10th International Conf. on Artificial Intelligence and SoftComputing. pp. 487–494. LNAI (2010)




Correlation between Genetic Diversity and Fitness in a Predator-Prey Ecosystem Simulation

Documents

Transcript of Correlation between Genetic Diversity and Fitness in a Predator-Prey Ecosystem Simulation