Semantic search-based genetic programming and the effect of intron deletion

11
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS 1 Semantic Search-Based Genetic Programming and the Effect of Intron Deletion Mauro Castelli, Leonardo Vanneschi, and Sara Silva Abstract—The concept of semantics (in the sense of input–output behavior of solutions on training data) has been the subject of a noteworthy interest in the genetic programming (GP) research community over the past few years. In this paper, we present a new GP system that uses the concept of semantics to improve search effectiveness. It maintains a distribution of different semantic be- haviors and biases the search toward solutions that have similar semantics to the best solutions that have been found so far. We present experimental evidence of the fact that the new semantics- based GP system outperforms the standard GP and the well-known bacterial GP on a set of test functions, showing particularly inter- esting results for noncontinuous (i.e., generally harder to optimize) test functions. We also observe that the solutions generated by the proposed GP system often have a larger size than the ones returned by standard GP and bacterial GP and contain an elevated number of introns, i.e., parts of code that do not have any effect on the semantics. Nevertheless, we show that the deletion of introns dur- ing the evolution does not affect the performance of the proposed method. Index Terms—Generalization, genetic programming (GP), introns, semantics. I. INTRODUCTION G ENETIC programming (GP) [1] belongs to the artificial intelligence research area which is called evolutionary computation, and consists in the automated learning of computer programs mimicking the Darwinian principles of evolution. Its theoretical properties have been widely investigated in the past two decades, even though many open issues persist (see, for instance, [2]). In particular, the study of the GP dynamics has often been based on the genotypic properties (i.e., the syntacti- cal structure) of the evolving solutions (also called individuals). While the study of genotypes can be useful to capture particular phenomena that are related to the evolutionary process, it is of- ten not able to describe its entire dynamics. For this reason, over the past few years, the idea to incorporate semantic awareness in the GP process became popular (see, for instance, [3]–[7]). Manuscript received April 19, 2012; revised July 13, 2012 and October 16, 2012; accepted February 12, 2013. This work was supported by National Funds through FCT under Contract Pest-OE/EEI/LA0021/2011 and Project PTDC/EIA-CCO/103363/2008. This paper was recommended by Associate Editor N. O. Attoh-Okine of the former IEEE Transactions on Systems, Man and Cybernetics, Part C: Applications and Reviews. M. Castelli and L. Vanneschi are with the Instituto de Engenharia de Sis- temas e Computadores-Investigac ¸˜ ao e Desenvolvimento em Lisboa, IST/UTL, Lisboa 1000-029, Portugal, and also with the Instituto Superior de Estat´ ıstica e Gest˜ ao de Informac ¸˜ ao (ISEGI), Universidade Nova de Lisboa, Lisboa 1070-312, Portugal (e-mail [email protected]; [email protected]). S. Silva is with the Instituto de Engenharia de Sistemas e Computadores- Investigac ¸˜ ao e Desenvolvimento em Lisboa, IST/UTL, Lisboa 1000-029, Portugal (e-mail: [email protected]). Digital Object Identifier 10.1109/TSMCC.2013.2247754 In these studies, the concept of solutions’ semantics refers to the set of their input–output pairs (or input–output behavior) on the training data (a terminology that will be adopted in this study too), and it is a common belief that integrating this kind of knowledge in the GP process can improve performance, extend- ing the applicability of GP to problems that are difficult with purely syntactic approaches. In this paper, a new GP system which is based on semantics (SEM-GP) is proposed. The idea consists in building, maintain- ing, and updating generation by generation a semantical distri- bution, biasing the search toward areas of the semantic space characterized by good fitness values. Although original, the idea is inspired by the recently proposed bloat control method which is called operator equalization (OpEq) [8]. In fact, OpEq main- tains a distribution of size and biases the search toward sizes associated with good fitness values, among the ones evolved so far. The similarities between SEM-GP and OpEq, however, are limited to the acceptance criteria of new individuals produced by the genetic operators. Beyond this, the basic idea that char- acterizes the two methods is different: While OpEq builds and uses a distribution based on the syntax of the individuals, the study proposed here focuses on semantics. In order to experimentally validate the proposed system, we consider a set of well-known noncontinuous test functions, com- paring the performances of SEM-GP with the ones of standard GP (ST-GP) and a well-known GP variant called bacterial GP (BA-GP), which was introduced in [9]. Both ST-GP and BA-GP use genetic operators that are completely based on the solutions syntax. Besides studying the effect of SEM-GP on the quality of the generated solutions (in terms of fitness), we also study their size, and we investigate the presence of introns. Finally, we present a study of SEM-GP with systematic intron deletion, showing its effects on the quality of the generated solutions. This paper is organized as follows. In Section II, we revise the state of the art, discussing previous and related research. In Section III, we present the SEM-GP method, describing and discussing its functioning in detail. Section IV presents our ex- perimental study; in particular, it describes the test problems that we have used, it discusses in detail the experimental setting in order to make our experimental study completely replicable, and it discusses the obtained results. Finally, Section V concludes this paper. II. PREVIOUS AND RELATED WORK A. Previous Attempts of Improving Genetic Programming The success obtained by GP since its early years is undeniable [1], [2]. Nevertheless, it is nowadays a common opinion that, in 2168-2267/$31.00 © 2013 IEEE

Transcript of Semantic search-based genetic programming and the effect of intron deletion

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CYBERNETICS 1

Semantic Search-Based Genetic Programmingand the Effect of Intron Deletion

Mauro Castelli, Leonardo Vanneschi, and Sara Silva

Abstract—The concept of semantics (in the sense of input–outputbehavior of solutions on training data) has been the subject ofa noteworthy interest in the genetic programming (GP) researchcommunity over the past few years. In this paper, we present a newGP system that uses the concept of semantics to improve searcheffectiveness. It maintains a distribution of different semantic be-haviors and biases the search toward solutions that have similarsemantics to the best solutions that have been found so far. Wepresent experimental evidence of the fact that the new semantics-based GP system outperforms the standard GP and the well-knownbacterial GP on a set of test functions, showing particularly inter-esting results for noncontinuous (i.e., generally harder to optimize)test functions. We also observe that the solutions generated by theproposed GP system often have a larger size than the ones returnedby standard GP and bacterial GP and contain an elevated numberof introns, i.e., parts of code that do not have any effect on thesemantics. Nevertheless, we show that the deletion of introns dur-ing the evolution does not affect the performance of the proposedmethod.

Index Terms—Generalization, genetic programming (GP),introns, semantics.

I. INTRODUCTION

G ENETIC programming (GP) [1] belongs to the artificialintelligence research area which is called evolutionary

computation, and consists in the automated learning of computerprograms mimicking the Darwinian principles of evolution. Itstheoretical properties have been widely investigated in the pasttwo decades, even though many open issues persist (see, forinstance, [2]). In particular, the study of the GP dynamics hasoften been based on the genotypic properties (i.e., the syntacti-cal structure) of the evolving solutions (also called individuals).While the study of genotypes can be useful to capture particularphenomena that are related to the evolutionary process, it is of-ten not able to describe its entire dynamics. For this reason, overthe past few years, the idea to incorporate semantic awarenessin the GP process became popular (see, for instance, [3]–[7]).

Manuscript received April 19, 2012; revised July 13, 2012 and October16, 2012; accepted February 12, 2013. This work was supported by NationalFunds through FCT under Contract Pest-OE/EEI/LA0021/2011 and ProjectPTDC/EIA-CCO/103363/2008. This paper was recommended by AssociateEditor N. O. Attoh-Okine of the former IEEE Transactions on Systems, Manand Cybernetics, Part C: Applications and Reviews.

M. Castelli and L. Vanneschi are with the Instituto de Engenharia de Sis-temas e Computadores-Investigacao e Desenvolvimento em Lisboa, IST/UTL,Lisboa 1000-029, Portugal, and also with the Instituto Superior de Estatıstica eGestao de Informacao (ISEGI), Universidade Nova de Lisboa, Lisboa 1070-312,Portugal (e-mail [email protected]; [email protected]).

S. Silva is with the Instituto de Engenharia de Sistemas e Computadores-Investigacao e Desenvolvimento em Lisboa, IST/UTL, Lisboa 1000-029,Portugal (e-mail: [email protected]).

Digital Object Identifier 10.1109/TSMCC.2013.2247754

In these studies, the concept of solutions’ semantics refers tothe set of their input–output pairs (or input–output behavior)on the training data (a terminology that will be adopted in thisstudy too), and it is a common belief that integrating this kind ofknowledge in the GP process can improve performance, extend-ing the applicability of GP to problems that are difficult withpurely syntactic approaches.

In this paper, a new GP system which is based on semantics(SEM-GP) is proposed. The idea consists in building, maintain-ing, and updating generation by generation a semantical distri-bution, biasing the search toward areas of the semantic spacecharacterized by good fitness values. Although original, the ideais inspired by the recently proposed bloat control method whichis called operator equalization (OpEq) [8]. In fact, OpEq main-tains a distribution of size and biases the search toward sizesassociated with good fitness values, among the ones evolved sofar. The similarities between SEM-GP and OpEq, however, arelimited to the acceptance criteria of new individuals producedby the genetic operators. Beyond this, the basic idea that char-acterizes the two methods is different: While OpEq builds anduses a distribution based on the syntax of the individuals, thestudy proposed here focuses on semantics.

In order to experimentally validate the proposed system, weconsider a set of well-known noncontinuous test functions, com-paring the performances of SEM-GP with the ones of standardGP (ST-GP) and a well-known GP variant called bacterial GP(BA-GP), which was introduced in [9]. Both ST-GP and BA-GPuse genetic operators that are completely based on the solutionssyntax.

Besides studying the effect of SEM-GP on the quality of thegenerated solutions (in terms of fitness), we also study their size,and we investigate the presence of introns. Finally, we present astudy of SEM-GP with systematic intron deletion, showing itseffects on the quality of the generated solutions.

This paper is organized as follows. In Section II, we revisethe state of the art, discussing previous and related research.In Section III, we present the SEM-GP method, describing anddiscussing its functioning in detail. Section IV presents our ex-perimental study; in particular, it describes the test problems thatwe have used, it discusses in detail the experimental setting inorder to make our experimental study completely replicable, andit discusses the obtained results. Finally, Section V concludesthis paper.

II. PREVIOUS AND RELATED WORK

A. Previous Attempts of Improving Genetic Programming

The success obtained by GP since its early years is undeniable[1], [2]. Nevertheless, it is nowadays a common opinion that, in

2168-2267/$31.00 © 2013 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON CYBERNETICS

order to tackle complex applications, the original definition ofGP (often called vanilla GP for the simplicity of its functioning,or standard GP like in this paper) has to be improved. The greatversatility of the evolutionary process that governs GP makesthe task of refining it a particularly viable, although challenging,one [10].

One of the first efforts in this direction was inspired by theidea that what makes a problem complex and, thus, difficultto solve is usually the huge amount of information availableas data (large dimensions of the feature space). Typically, notall the available information gives insight into the problem,and choosing the useful one (feature selection) is an extremelyimportant task. The idea to select relevant features using GP waspresented in [11]. The idea of simultaneously evolving solutionsand selecting the most promising features by means of GP wasfurther investigated, with extremely interesting results, in [12].

The authors of [13] go even beyond this idea. In fact, itpresents a GP system in which not only important features areselected and irrelevant ones disregarded, but also available fea-tures are combined to generate brand new, and potentially moreinsightful, ones (feature generation). The authors of [13] statethat GP is a reasonable candidate for feature generation, givenits ability to synthesize complex functions using simple inputdata. This fact is confirmed in [14], where feature generationby means of GP is used to tackle complex object recognitionapplications.

Another straightforward idea to improve GP comes from theintrinsically parallel nature of evolution, where large sets of in-dividuals are processed at the same time. Parallel and distributedversions of GP have been defined so far. For instance, in [15] adistributed system for knowledge discovery in data mining waspresented. This idea was pursued in [16], where a distributed GPsystem was used to specialize the search into several subtasks.In particular, large amount of information is managed by parti-tioning huge datasets into hierarchic, dynamic subsets that areprocessed in parallel. In the same vein, in [17] the entire evolu-tionary process is partitioned into two main stages, specializedfor solving particular problem parts or instances, and executedin parallel.

Another interesting characteristic of GP, that distinguishes itfrom many other machine-learning methods, is the possibilityto define a fitness function. This task, even though complex insome cases, clearly contributes to GP versatility, allowing us todirect the search in several different directions. In [18], GP isused to dynamically develop new fitness functions to improvethe power of the evolutionary search.

Besides the definition of the fitness, another contribution toGP versatility is given by the possibility to emend the representa-tion used to code candidate solutions, adapting it to the problemcharacteristics. It is, than, natural to introduce new and poten-tially more descriptive representations. In [19], the traditionaltree-based representation which was used by GP is extended toa more general and sophisticated graph-based one. The authorsshow that this new representation is able to solve more complexproblems faster than the traditional one.

Another important component of GP is represented by thegenetic operators that are used to explore the search space.

The idea of improving genetic operators was developed in [20],where a distance measure was defined and used to model thebehavior of several genetic operators.

The main difference between the approaches discussed so farand the one presented in this paper is that all the approachesdiscussed so far explore the search space using operators thatare based on the syntactic structure of the evolving individuals,like in standard GP. Over the past few years, several approachesaimed to exploit semantics to improve GP have been presented,and this paper is framed in this perspective. Previous studies onsemantics in GP are discussed in the next section.

B. Semantics in Genetic Programming

Semantic analysis methods made their appearance for thestudy and modeling of crossover. McPhee et al. [4] used truthtables to analyze behavioral changes in crossover for booleanproblems. As an alternative to representing behavior using truthtables, reduced ordered binary decision diagrams (ROBDDs)were used in [19] to measure behavioral difference betweenparents and offspring of crossover. In the same vein, in [21] se-mantics is used to define an algorithm called semantically drivencrossover (SDC). The SDC algorithm has been developed basedon the analysis of the behavioral changes caused by crossover.The key feature of this method is the use of a canonical repre-sentation of members of the population (ROBDDs) to check forsemantic equivalence without having to access the fitness func-tion. This is used to determine which participating individualsare copied to the next generation. If the offspring are semanti-cally equivalent to their parents, the children are discarded, andthe crossover is restarted. The authors argue that this results inincreased semantic diversity in the evolving population, and aconsequent improvement in the GP performance. Since the pre-vious approaches work only on the problem characterized bydiscrete values, in [5] one method to incorporate semantic in-formation into GP crossover operators for real-valued symbolicregression problems is proposed. In particular, the authors aim toincorporate semantics into the design of new crossover operatorsso as to maintain greater semantic diversity, and provide higherlocality (continuity—small changes in genotype correspondingto small changes in phenotype) than standard crossover. Re-cently, Krawiec and Lichocki proposed a way to measure thesemantics of an individual based on fitness cases [7]. The se-mantics of an individual was defined as a vector in which eachelement is the output of the individual at the corresponding inputfitness case. This definition, that we adopt in this paper as well,was used to guide crossover in a method which is known asapproximating geometric crossover (AGC). In AGC, a numberof children are generated at each crossover, the children mostsimilar to their parents (in terms of semantics) being added tothe next generation. In [22], semantics is used to test the effectsof behavioral control at the point of the mutation operator. Us-ing semantic analysis, the authors present a technique which isknown as semantically driven mutation (SDM).

The contributions mentioned so far differ from the one pre-sented here mainly because they have been introduced with theexplicit task of studying or improving standard crossover or mu-

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CASTELLI et al.: SEMANTIC SEARCH-BASED GENETIC PROGRAMMING AND THE EFFECT OF INTRON DELETION 3

tation. On the other hand, the approach that we propose in thispaper is more general, since it can be applied to any kind ofgenetic operator.

III. SEMANTIC NICHING METHOD

The main idea of the proposed “semantic niching” method isto use a semantic distribution to guide the evolutionary process.This distribution is used to direct the algorithm toward solutionsthat are semantically close to the best solution found so far byGP. Before explaining how the method works, we introducesome preliminary concepts.

A. Preliminary Concepts

Although there is no firm consensus around the formal-ization of this term, we may roughly define the expressionsemantics as a concise and minimal (irreducible) descriptionof what the program is doing, expressed in simpler terms thanthe original GP tree (for the sake of this study, we identify aGP program with a tree, like in [1]). As GP programs usuallyprocess some input data, semantics is often defined with respectto them. In this paper, we rely on the definition of programsemantics inspired by the work in [23] and [4]. Strengths (andweaknesses) of this definition are reported in [7]. In the pro-posed SEM-GP method, program semantics is captured usingan n-dimensional vector w of program outputs on the givenset of training input data (where n is the size of the train-ing dataset); this is called the semantic vector. If we considera training set T = {(x1 , y1), (x2 , y2), . . . , (xn , yn )} of input–output pairs, and p(xi) the output from the execution of programp with input xi , then wi = p(xi). The function F maps a se-mantic vector w into a scalar value via the use of an error-basedfitness function that takes the form of average Canberra dis-tance [24]:

F (w, T ) =1n

n∑

i=1

abs(yi − wi)abs(yi) + abs(wi)

(1)

where abs returns the absolute value of its argument, and wi andyi are the predicted and target values, respectively, for input xi .

Many other distance measures could be used alternatively,but in this paper we have chosen to use the Canberra distance inorder to have fitness values always contained in the range [0, 1].This simplifies our experimental study and makes the resultsmore intuitive.

Sampling of program semantics in an evolving population ofcomputer programs can be associated with a probability densityfunction (pdf). This pdf determines which search space areas aresampled throughout the evolutionary run, and is adaptive as newprograms are sampled in every generation and more informationis gathered about the merit of certain program semantic areasin the semantic space. A histogram is used to represent thesemantic pdf. This is obtained by dividing the program semanticspace into bins called semantic niches (sn) and approximatingthe density at each semantic niche using a number called thecapacity, which is inversely proportional to the average trainingerror of the programs belonging to a semantic niche. A semantic

niche is itself represented by an n-dimensional vector w. In orderto define the “boundaries” of a semantic niche sn representedby vector wsn and allocate a program p with semantics wp to sn,a threshold value ε on the average Canberra distance betweenvectors wsn and wp is used, that is

d(wsn , wp) =1n

n∑

i=1

abs(wsni− wpi

)abs(wsni

) + abs(wpi)

< ε. (2)

This determines the semantic similarity between the nicheand the program. As we show later, a threshold-based seman-tic similarity allows programs of different fitness to belong tothe same semantic niche. In the case of error minimization, thelower the average training error of the individuals belonging toa semantic niche, the higher the niche’s capacity. The purposeof the target semantic distribution is to impose a probabilityto sample programs that are similar to certain semantic niches.The probability to sample a program similar to a semantic nicheis given by the capacity of the semantic niche divided by thetotal number of programs in the population. This exerts a biastoward sampling of programs that are semantically similar to“highly probable” semantic niches. At the same time, the dis-tribution takes care of the spread and diversity of semantics inthe population, and further acts as a mechanism of semanticniching.

B. SEM-GP: The Algorithm

In the beginning of the evolutionary search, a new semanticniche sni is created for every program pi of the initial popu-lation. Each sni is represented by the same vector wsni

thatrepresents the semantics wpi

of pi . In order to enforce semanticdiversity, it is required that no semantically similar individualsare generated during the initialization process. For this purpose,an iterative process is implemented that discards every newlycreated individual if it is found to be semantically similar to oneof the already generated programs. As discussed earlier, seman-tic similarity is governed by a threshold value on the averageCanberra distance between two semantic vectors wp1 and wp2 ,where p1 and p2 are their respective programs. The process iter-ates as many times as necessary to create the initial population,in light of offspring rejections that take place in the attemptto enforce semantic diversity. At the end of the initializationprocess, each semantic niche has a capacity equal to one.

SEM-GP uses a generational replacement strategy by whichan intermediate population is created through the application ofvariation operators, and at the end of this process the interme-diate population replaces the current population. The creationof an intermediate population is described in the pseudocode ofAlgorithm 1. The main constituent elements are as follows.

1) A procedure that updates the semantic distribution by re-allocating individuals in semantic niches and calculat-ing the capacities for the next generation (depicted inAlgorithm 3).

2) A procedure that handles an offspring’s acceptance inthe intermediate population, forcing the sampling of newprograms to be governed by the semantic distribution(depicted in Algorithm 2).

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON CYBERNETICS

Prior to the creation of the intermediate population, the se-mantic distribution is updated. The update consists of two steps.

1) In current generation g reallocate individuals to seman-tic niches in the case where new semantic niches werecreated during generation g − 1. In fact, an individualpi with semantic vector wpi

initially belonging to se-mantic niche sni with vector wsni

may be allocatedto the newly created niche snk with vector wsnk

inthe case where ε > d(wpi

, wsnk) > d(wpi

, wsni), where

d(w1 , w2) computes the average Canberra distance (2)between semantic vectors w1 and w2 , and where ε is thesemantic similarity threshold.

2) Set the capacities of semantic niches to a number thatis inversely proportional to the average training error ofthe programs that belong to certain niches. The trainingerror ep of a program p is calculated using the error-basedfitness function of (1). Given that ep is normalized withinthe interval [0, 1], we define training fitness fp = 1.0 − ep ,and the average training fitness of programs belonging toa niche is fsn . The capacity of a semantic niche can nowbe set proportional to fsn . That is, the capacity csn ofa niche sn is calculated as csn = �n × fsn/

∑Mi=1 fsni

�,where fsni

is the average training fitness of semantic nichesni, fsn is the average training fitness of individuals insemantic niche sn, n is the population size, and M isthe number of semantic niches. Alternatively, it can becalculated as csn = �n × (1 − esn/

∑Mi=1 esni

)�, whereesn is the average training error of individuals in niche snand esni

is the average training error of the individualsallocated to sni .

The offspring acceptance criterion biases the creation of everyoffspring according to the current state of the semantic distri-bution. The procedure is given as pseudocode in Algorithm 2.The first step is to extract the program semantics using vectorwp , and determine semantic similarities with any of the exist-ing semantic niches. Given a set SN = {sn1 , sn2 , . . . , snk}of already created semantic niches represented by vectors{wsn1 , wsn2 , . . . , wsnk

} ,∀sni ∈ SN we calculate the averageCanberra distance d(wp,wsni

). Considering the set of distancesD = {d(wp,wsn1 ), d(wp,wsn2 ), . . . , d(wp,wsnk

)}, let d theminimal distance in D and sn the semantic niche for whichd(w, sn) = d. The offspring belongs to semantic niche sn if d

is within a threshold ε set a priori. In the other case where dis greater than the acceptance threshold, the offspring does notbelong to any of the existing semantic niches of the set SN . Theprocedure is depicted in the pseudocodes in Algorithm 5 andAlgorithm 4.

In the case where the offspring is not semantically similar toany of the existing niches, a new semantic niche representative ofthis offspring’s semantics is created if and only if the offspring isthe best individual of the run; otherwise, the offspring is rejected.If, on the other hand, the offspring is found to be semanticallysimilar to one of the existing niches sni , then the offspring isaccepted subject to two constraints.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CASTELLI et al.: SEMANTIC SEARCH-BASED GENETIC PROGRAMMING AND THE EFFECT OF INTRON DELETION 5

1) snsize < sncapacity : In order manage the allocations toniches, every niche is associated with a niche size that isbeing incremented when a program is allocated to a niche.All niche sizes as initialized to zero after the update ofeach niche’s capacity that takes place in the beginning ofthe process of intermediate population creation.

2) snsize > sncapacity , and offspring is the best performingprogram of the niche.

In summary, SEM-GP starts with a semantically diverse popu-lation of programs. Every application of a variation operator hasthe potential of an exploratory move in the search space giventhat no assumptions are made about the causality of programrepresentation, and not special variation operators are employedother than the ones defined in standard GP practice. SEM-GP’smechanism of population replacement based on the offspringacceptance criterion of Algorithm 2 ensures that an exploratorymove in the space is only performed if an offspring maps tothe point of highest fitness attained so far in fitness space F .In any other case, a bias toward local exploitations of programsemantics is exerted, and this bias is governed by an adaptivedistribution that attempts to allocate the program sampling effortaround semantic state space areas in proportion to their fitness.

IV. EXPERIMENTAL STUDY

A. Test Problems

In this section, we report the problems that are used in theexperimental phase. Five regression problems have been cho-sen. The test functions used are reported in Fig. 1. The choiceof test functions was based on their previous use as benchmarkproblems in the study of GP systems (see, for example, [25]

and [26]). We are aware that these test functions are bidimen-sional and, thus, do not represent real-life applications (typicallycharacterized by many features and, thus, multidimensional),nevertheless, as reported in [26] at page 8: “The aim of this setof experiments is to demonstrate the practical implications ofthe use of the [method] studied here. Being of low dimension-ality does not make the problems easy however. Many of theproblems above mix trigonometry with polynomials, or makethe problems in other ways highly nonlinear.” Ranges are de-noted using (start : step : stop). Random numbers are denotedusing rand(min, max), which defines uniform random samplingin the range. In the second phase of our experiments, we alsoconsidered real-life problems that are characterized by a highnumber (greater than 100) of features. Results obtained withthese functions are similar to the ones obtained by the test func-tions that are shown in Fig. 1; hence, we do not report themhere.

B. Experimental Settings

For every test problem, 30 runs of the considered techniqueshave been performed. All the runs used populations of 100 indi-viduals for ST-GP and SEM-GP and 25 individuals for BA-GP.Different settings for the different studied GP variants have beenobtained by means of a preliminary experimental phase aimedto tune the parameters for each one of them independently. Theevolution stopped after 40 000 fitness evaluations for all theGP variants. Tree initialization was performed with the rampedhalf-and-half method [1] with a maximum initial depth of 6. Thefunction set contained the four binary operators +,−, ∗, and /protected as in [1]. The terminal set contained 1 variable and100 random constants in the range [0, 1]. Tournament selectionhas been used with tournament size equal to 4. The reproduction(replication) rate was 0.1, meaning that each selected parent hasa 10% chance of being copied to the next generation instead ofbeing engaged in breeding. Standard tree mutation and standardcrossover (with uniform selection of crossover and mutationpoints among different tree levels) were used with probabilitiesof 0.1 and 0.9, respectively. The new random branch createdfor mutation has maximum depth 6. Selection for survival waselitist, and only the best individual is inserted in the new pop-ulation. No maximum tree depth was imposed. Regarding themutation and crossover for BA-GP (that does not perform anyselection method to candidates), the parameters are the numberof clones Nclones and the number of infections Ninf . For a de-scription of this parameter, see [9]. In our experiments, we usedNclones = 4 and Ninf = 5. The fitness of the individuals is theCanberra distance [24] between outputs and targets normalizedbetween 0 and 1, where 0 is the optimal fitness. The median waspreferred over the mean in all the evolution plots shown in thenext section because median is more robust to outliers.

C. Experimental Results

In this section, the obtained experimental results are reported.In particular, Section IV-C1 presents results that are obtainedconsidering the SEM-GP method, ST-GP and BA-GP. The ex-perimental phase has been performed to check whether or not

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON CYBERNETICS

Fig. 1. Test functions.

TABLE ISUMMARY OF CANBERRA ERROR STATISTICS FOR THE CONSIDERED GP SYSTEMS

Statistics based on 30 independent runs.

the proposed semantic niching method is able to produce betterresults with respect to ST-GP and BA-GP. In particular, we areinterested in checking the generalization ability of the proposedtechnique. To check the generalization ability, we consider thefitness on unseen data reported in the column labeled “TestData” in Fig. 1. In the second part of the experiments, we an-alyze whether or not the deletion of introns (that are widelypresent in the individuals produced by SEM-GP) affects theperformance of the best solutions produced by the SEM-GPsystem. We decided to perform this study given that our pro-posed semantic method produces bigger trees than ST-GP andBA-GP (as reported in Section IV-C2), and these trees are char-acterized by a large number of introns. The size of a tree isdefined as the number of its nodes. Results of this second partof the experimental phase are reported in Section IV-C3.

1) Semantic Genetic Programming Versus Standard GeneticProgramming and Bacterial Genetic Programming: The testfitness of the best element on training data is reported, for allthe considered test functions, in Fig. 2. As is possible to see, in

all the considered test functions SEM-GP is the best performer.Table I reports a summary of Canberra error statistics for theconsidered GP systems. A statistical validation of the resultshas been performed. In particular, the statistical significance ofthe results has been evaluated with the Mann–Whitney test [27],considering a confidence of 95% with a Bonferroni correctionfor the value of α. According to the Mann–Whitney test, theresults that are produced by the SEM-GP method are statisticallydifferent from the ones produced by ST-GP and BA-GP for allthe considered test functions. Results of the Mann–Whitney testare reported in Table II. In the experimental phase, we have alsoanalyzed the convergence of the proposed SEM-GP methodwith respect to other considered techniques. To do that, we haveconsidered the method proposed in [28], which considers theentropy of the population and the diversity of its individuals.Results of this analysis are not reported here, but the SEM-GPmethod has shown no premature convergence.

2) Size of the Solutions: While SEM-GP produces bettersolutions than ST-GP and BA-GP, it has an important drawback.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CASTELLI et al.: SEMANTIC SEARCH-BASED GENETIC PROGRAMMING AND THE EFFECT OF INTRON DELETION 7

Fig. 2. Test fitness of the best individuals with respect to the number of evaluations for test function NC1 , NC2 , NC3 , NC4 , and NC5 . The median over 30runs has been reported.

TABLE IIp-VALUES OBTAINED BY COMPARING THE DIFFERENCES IN THE MEDIAN TEST

ERROR (TABLE I) OF SEM-GP AGAINST ST-GP AND BA-GP ON THE

CONSIDERED TEST PROBLEMS USING THE MANN–WHITNEY U-TEST

Values lower than 10−10 reported as 0.

In fact, the size of the individuals produced is statistically biggerthan the ones of the individuals returned by both ST-GP andBA-GP. This difference in terms of size is reported in Fig. 3.This difference in terms of solutions’ size is also confirmedby the execution time: an SEM-GP run is 1.8 times slower

than a run of ST-GP and 1.6 times slower than a run of BA-GP. Given the huge difference of solutions’ size between theconsidered techniques, the difference in terms of execution timeis not so relevant. An analysis of the individuals allowed us todiscover that SEM-GP’s individuals are characterized by a largenumber of introns. This analysis motivates the study performedin Section IV-C3.

3) Intron Deletion: Ever since the first experiments on GPwere carried out, it has been observed that regions of nonfunc-tional code are created, even in the fittest programs. Regions ofredundant code in genetic programs have been often referredto as “introns,” a term used in biology to indicate sections ofDNA that are noncoding sequences of proteins and appear tohave no useful function. The effect of introns in GP has beenwidely investigated (see, for instance, [29] and [30]). In this sec-tion, we analyze whether or not the deletion of introns from the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON CYBERNETICS

Fig. 3. Size of the best individuals for test functions NC1 , NC2 , NC3 , NC4 , and NC5 for ST-GP, BA-GP, and SEM-GP. Thirty runs have been considered.

individuals has some effects on the performance of the fittest in-dividual. To do that, we compare results obtained from SEM-GP(and presented in the previous section) with the ones producedby SEM-GP with an intron deletion mechanism. In particular, inthis latter GP system, we detect and remove introns from all theindividuals at the end of every fitness evaluation. We refer to thissystem as SEM-GP-DEL. The test fitness of the best elementsis reported, for all the considered test functions, in Fig. 4. In ad-dition, in this part of the experimental phase, a statistical valida-tion of the results has been performed. As before, the statisticalsignificance of the results has been evaluated with the Mann–Whitney test considering a confidence of 95%. Results of theMann–Whitney test are reported in Table III. While for the testfunction NC5 the performances of SEM-GP and SEM-GP-DELare statistically different (with the former method producing thebest solutions), for the remaining test functions the two meth-ods produce solutions whose fitness values are not statisticallydifferent. Moreover, the effect of intron deletion on these latter

TABLE IIIp-VALUES OBTAINED BY COMPARING THE DIFFERENCES IN THE MEDIAN TEST

ERROR OF BEST INDIVIDUALS OF SEM-GP AGAINST SEM-GP-DEL ON THE

CONSIDERED TEST PROBLEMS USING THE MANN–WHITNEY U-TEST

30 runs have been considered.

test functions is difficult to interpret: considering test functionsNC2 and NC4 , it is possible to note that the deletion of theintrons enhances the standard deviation between the fitness ofthe best solutions with respect to SEM-GP. On the other hand,considering test function NC4 we have an opposite behavior,where SEM-GP-DEL is the method that minimizes the standarddeviation of the fitness of the best individuals.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CASTELLI et al.: SEMANTIC SEARCH-BASED GENETIC PROGRAMMING AND THE EFFECT OF INTRON DELETION 9

Fig. 4. Fitness of the best individuals for test functions NC1 , NC2 , NC3 , NC4 , and NC5 for SEM-GP and SEM-GP-DEL. Thirty runs have been considered.

In summary, the deletion of introns does not produce a statis-tically significant degradation of the performances with respectto the ones obtained by SEM-GP. This fact could be motivatedconsidering that the proposed method is mainly based on se-mantic: Thus, the search is biased toward areas of the searchspace where good solutions lie. The deletion of introns onlycauses a greater number of rejections and a slowdown of theconvergence toward the best solutions. Hence, the negative ef-fect of the crossover is handled by the SEM-GP method withthe acceptance criterion depicted in Algorithm 2. Regardingtest function NC5 , the difference in terms of fitness betweenthe considered methods could be explained by the fact that ahigh number of rejections occurs. Hence, SEM-GP-DEL is notable to find (for this test problem) solutions of the same qualityof the ones produced by SEM-GP before the imposed limit of40 000 evaluations. It is worth pointing out that a further timeoverhead is introduced in SEM-GP-DEL by the intron deletionoperation. An SEM-GP-DEL run is 2.3 times slower than an

ST-GP run and 2.0 times slower than a run of BA-GP. In con-clusion, this final part of the experiments does not contradict thehypothesis proposed in [29] regarding the protection role of theintrons against the destructive effect of crossover; simply, thiseffect is mitigated by the acceptance criterion that characterizedthe proposed SEM-GP algorithm.

V. CONCLUSION

In this paper, we have presented a new semantic-based GPsystem which is called SEM-GP. SEM-GP creates a bias towardlocal exploitation of program semantics, and this bias is gov-erned by an adaptive distribution that attempts to allocate theprogram sampling effort around semantic state space areas inproportion to their fitness. Experimental results show that SEM-GP performs significantly better than standard GP (ST-GP) andbacterial GP (BA-GP), at least for all the considered test func-tions. A further analysis has been performed, with the objective

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON CYBERNETICS

to understand whether or not the deletion of introns from theindividuals created by SEM-GP affects the performances of thebest solutions produced by the proposed method. The startingpoint of this analysis was the observation that SEM-GP pro-duces individuals that are even ten times bigger than the onesproduces by ST-GP and BA-GP. Experimental results show thatthe deletion of introns does not result in a reduction of the qualityof the best individuals, when compared with the ones obtainedwithout the deletion. These results are counterintuitive if we con-sider [29], where authors hypothesized that introns play a globalprotection role, enabling an individual to protect itself againstthe destructive effect of crossover. Indeed, our study suggeststhat the hypothesis formulated in [29] is not denied: Simply, theproposed semantic search algorithm is able to counteract thedestructive effect of the crossover, rejecting individuals that donot satisfy the acceptance criterion.

REFERENCES

[1] J. R. Koza, Genetic Programming: On the Programming of Computers byMeans of Natural Selection. Cambridge, MA, USA: MIT Press, 1992.

[2] R. Poli, L. Vanneschi, W. Langdon, and N. McPhee, “Theoretical resultsin genetic programming: The next ten years?” Genet. Program. EvolvableMach., vol. 11, no. 3, pp. 285–320, Sep. 2010.

[3] U. N. Quang, M. O’Neill, X. H. Nguyen, B. McKay, and E. G. Lopez,“Semantic similarity based crossover in GP: The case for real-valued func-tion regression,” in Artificial Evolution (Lecture Notes in Computer Sci-ence, vol. 5975). Berlin, Germany: Springer-Verlag, Oct. 26–28, 2009,pp. 170–181.

[4] N. F. McPhee, B. Ohs, and T. Hutchison, “Semantic building blocksin genetic programming,” in Genetic Programming (Lecture Notes inComputer Science, vol. 4971). Berlin, Germany: Springer-Verlag, 2008,pp. 134–145.

[5] N. Q. Uy, N. X. Hoai, M. O’Neill, R. I. McKay, and E. Galvan-Lopez,“Semantically-based crossover in genetic programming: Applicationto real-valued symbolic regression,” Genet. Programm. Evolv. Mach.,vol. 12, no. 2, pp. 91–119, Jun. 2011.

[6] L. Beadle and C. G. Johnson, “Semantic analysis of program initialisationin genetic programming,” Genet. Program. Evolv. Mach., vol. 10, no. 3,pp. 307–337, Sep. 2009.

[7] K. Krawiec and P. Lichocki, “Approximating geometric crossover in se-mantic space,” in Proc. 11th Annu. Conf Genetic Evolutionary Comput.,Jul. 8–12, 2009, pp. 987–994.

[8] S. Silva and S. Dignum, “Extending operator equalisation: Fitness basedself adaptive length distribution for bloat free GP,” in Genetic Program-ming, (Lecture Notes in Computer Science, vol. 5481). Berlin, Germany:Springer-Verlag, 2009, pp. 159–170.

[9] J. Botzheim, C. Cabrita, L. T. Koczy, and A. E. Ruano, “Genetic and bac-terial programming for B-spline neural networks design,” J. Adv. Comput.Intell. Intell. Inf., vol. 11, no. 2, pp. 220–231, Feb. 2007.

[10] P. G. Espejo, S. Ventura, and F. Herrera, “A survey on the application ofgenetic programming to classification,” IEEE Trans. Syst. Man, Cybern.C, vol. 40, no. 2, pp. 121–144, Mar. 2010.

[11] Y. Lin and B. Bhanu, “Evolutionary feature synthesis for object recog-nition,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 35, no. 2,pp. 156–171, May 2005.

[12] D. P. Muni, N. R. Pal, and J. Das, “Genetic programming for simultaneousfeature selection and classifier design,” IEEE Trans. Syst., Man, Cybern.B, vol. 36, no. 1, pp. 106–117, Feb. 2006.

[13] H. Guo, L. B. Jack, and A. K. Nandi, “Feature generation using geneticprogramming with application to fault classification,” IEEE Trans. Syst.,Man, Cybern. B, vol. 35, no. 1, pp. 89–99, Feb. 2005.

[14] Y. Lin and B. Bhanu, “Evolutionary feature synthesis for object recog-nition,” IEEE Trans. Syst., Man Cybern. C, Appl. Rev., vol. 35, no. 2,pp. 156–171, May 2005.

[15] K. C. Tan, Q. Yu, and T. H. Lee, “A distributed evolutionary classifier forknowledge discovery in data mining,” IEEE Trans. Syst., Man, Cybern.C, vol. 35, no. 2, pp. 131–142, May 2005.

[16] R. Curry, P. Lichodzijewski, and M. I. Heywood, “Scaling genetic pro-gramming to large datasets using hierarchical dynamic subset selection,”

IEEE Trans. Syst., Man, Cybern. B, vol. 37, no. 4, pp. 1065–1073, Aug.2007.

[17] M. Alam, M. Islam, X. Yao, and K. Murase, “Recurring two-stage evolu-tionary programming: A novel approach for numeric optimization,” IEEETrans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 5, pp. 1352–1365, Oct.2011.

[18] U. Bhowan, M. Johnston, and M. Zhang, “Developing new fitness func-tions in genetic programming for classification with unbalanced data,”IEEE Trans. Syst., Man, Cybern. B, vol. 42, no. 2, pp. 406–421, Apr.2012.

[19] R. E. Bryant, “Graph-based algorithms for boolean function manipula-tion,” IEEE Trans. Comput., vol. 35, no. 8, pp. 677–691, Aug. 1986.

[20] U.-M. O’Reilly, “Using a distance metric on genetic programs to under-stand genetic operators,” in Proc. IEEE Int. Conf. Syst., Man, Cybern.,Comput. Cybern. Simul., Orlando, FL, USA, Oct. 12–15, 1997, vol. 5,pp. 4092–4097.

[21] L. Beadle and C. Johnson, “Semantically driven crossover in geneticprogramming,” in Proc. IEEE World Congr. Comput. Intell., Jun. 1–6,2008, pp. 111–116.

[22] L. Beadle and C. G. Johnson, “Semantically driven mutation in geneticprogramming,” in Proc. IEEE Congr. Evolutionary Comput., May 18–21,2009, pp. 1336–1342.

[23] R. Poli and J. Page, “Solving high-order boolean parity problems withsmooth uniform crossover, sub-machine code gp and demes,” Genet. Pro-gramm. Evolv. Mach., vol. 1, no. 1–2, pp. 37–56, Apr. 2000.

[24] G. N. Lance and W. T. Williams, “Mixed-data classificatory programs I:Agglomerative systems,” Austral. Comput. J., vol. 1, no. 1, pp. 15–20,1967.

[25] E. J. Vladislavleva, G. F. Smits, and D. den Hertog, “Order of nonlinearityas a complexity measure for models generated by symbolic regression viapareto genetic programming,” IEEE Trans. Evol. Comput., vol. 13, no. 2,pp. 333–349, Apr. 2009.

[26] M. Keijzer, “Improving symbolic regression with interval arithmetic andlinear scaling,” in Genetic Programming (Lecture Notes in ComputerScience). Berlin, Germany: Springer-Verlag, 2003, pp. 70–82.

[27] G. Corder and D. Foreman, Nonparametric Statistics for Non-Statisticians.New York, NY, USA: Wiley, 2009.

[28] E. K. Burke, S. Gustafson, and G. Kendall, “Diversity in genetic program-ming: An analysis of measures and correlation with fitness,” IEEE Trans.Evol. Comput., vol. 8, no. 1, pp. 47–62, Feb. 2004.

[29] P. Nordin, F. Francone, and W. Banzhaf, “Explicitly defined introns anddestructive crossover in genetic programming,” in Proc. Workshop GeneticProgram.: From Theory to Real-World Appl., Tahoe City, CA, USA, Jul.9, 1995, pp. 6–22.

[30] S. Garcia Carbajal and F. G. Martinez, “Evolutive introns: A non-costlymethod of using introns in GP,” Genet. Programm. Evol. Mach., vol. 2,no. 2, pp. 111–122, Jun. 2001.

Mauro Castelli received the Master’s degree(Laurea) in computer science from the University ofMilano Bicocca, Milan, Italy, in 2008 (“summa cumLaude”), and the Ph.D. degree from the University ofMilano Bicocca in 2012. His Ph.D. thesis presentedcontributions in the field of evolutionary computationand, in particular, genetic programming.

From February 2012 to February 2013, he hada Postdoctoral position at INESC-ID, Lisbon, Portu-gal. Since March 2013, he has been an Invited Assis-tant Professor with Instituto Superior de Estatıstica e

Gestao de Informacao (ISEGI), Universidade Nova de Lisboa, Lisbon, Portugal.His main research interests are in the field of artificial intelligence (in particular,evolutionary computation and genetic programming) and in the application ofmachine-learning techniques to solve complex real-life problems, especially inthe field of biology and medicine.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CASTELLI et al.: SEMANTIC SEARCH-BASED GENETIC PROGRAMMING AND THE EFFECT OF INTRON DELETION 11

Leonardo Vanneschi received the Master’s degree(“Laurea”) in computer science from the Universityof Pisa, Pisa, Italy, in 1996 (“summa cum Laude”),and the Ph.D. degree from the University of Lau-sanne, Lausanne, Switzerland, in 2004. His Ph.D.thesis was honored with the excellence award of theScience Faculty of the University of Lausanne.

Since September 2011, he has been an Assis-tant Professor with Instituto Superior de Estatısticae Gestao de Informacao (ISEGI), Universidade Novade Lisboa, Lisbon, Portugal, and an invited Associate

Researcher at INESC-ID, Lisbon, Portugal. His main research interests includecomputational intelligence and the study of complex systems, with particularfocus on evolutionary computation and genetic programming.

Dr. Vanneschi is an Associated Editor of a scientific journal in the area, anda member of the editorial board of two scientific journals and of the steeringcommittee and program committee of various international conferences. UntilFebruary 2013, he has published 130 scientific contributions, among which ninehave been honored with international awards.

Sara Silva received the B.Sc. and M.Sc. degrees ininformatics from the University of Lisbon, Lisbon,Portugal, and the Ph.D. degree in informatics engi-neering from the University of Coimbra, Coimbra,Portugal, in 2008.

She is a Senior Researcher with the Knowl-edge Discovery and Bioinformatics (KDBIO) groupat INESC-ID Lisboa, Portugal, and an Invited Re-searcher with the Evolutionary and Complex Sys-tems (ECOS) group at CISUC, Portugal. Her main re-search interests include bioinspired machine-learning

methods, in particular genetic programming, which she has applied in severalinterdisciplinary projects ranging from remote sensing and forest science toepidemiology and medical informatics.

Dr. Silva is a member of the editorial board of Genetic Programming andEvolvable Machines, the steering committee, and a program committee of var-ious international conferences, and the designer and developer of GPLAB—agenetic programming toolbox—for MATLAB. She has around 40 peer-reviewedscientific publications, four of which with international awards.